I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John Hennessy's collection of small integer benchmarks.
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I have seen big slowdowns (factor 5.7 on a Rocket Lake with
gcc-14.2) from gcc's auto-vectorization for the bubble-sort
benchmark of John Hennessy's collection of small integer
benchmarks.
What is the PR number?
By now you should know that I consider gcc bug reports a waste of
time. I last told you that in
<2025Jul15.080403@mips.complang.tuwien.ac.at> and gave PR93811 as an
example where I have wasted my time with creating a PR, and the status
of this PR has not changed in the meantime.
You seem to think that it is worthwhile creating gcc bug reports, so
go ahead and create one yourself. I think the web page contains all information necessary, but if you miss something, let me know.
- anton
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks.
What is the PR number?
By now you should know that I consider gcc bug reports a waste of
time.
I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't really
count.
Michael S <already5chosen@yahoo.com> schrieb:
I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't
really count.
The case in point appears to be a regression, which are supposed to be
fixed, and receive much higher attention than "normal" bugs.
I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't really
count.
On Sun, 1 Feb 2026 00:33:15 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't
really count.
The case in point appears to be a regression, which are supposed to be
fixed, and receive much higher attention than "normal" bugs.
According to my experience, it could receive higher attention than
average pessimization case, but there is close to zero chance
that it would be fixed at the end.
The typical scenario for such cases is that they "fall between chairs"
of tree optimization and target code generation and neither party is
taking responsibility.
Michael S <already5chosen@yahoo.com> schrieb:
I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't really
count.
The case in point appears to be a regression, which are supposed to be
fixed, and receive much higher attention than "normal" bugs.
I very rarely submit "missed optimization" bugs.
As far as I am concerned, missed optimization is not a bug, it is
business as usual, with the hope of improvement in the future.
My cases are nearly always "compiler tries too be too smart with
horrible consequences" rather than "compiler is too stupid".
Michael S <already5chosen@yahoo.com> schrieb:
On Sun, 1 Feb 2026 00:33:15 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't
really count.
The case in point appears to be a regression, which are supposed
to be fixed, and receive much higher attention than "normal" bugs.
According to my experience, it could receive higher attention than
average pessimization case, but there is close to zero chance
that it would be fixed at the end.
The typical scenario for such cases is that they "fall between
chairs" of tree optimization and target code generation and neither
party is taking responsibility.
Do you have an example?
On Sun, 1 Feb 2026 09:16:27 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't
really count.
Maybe some figures to put this into perspective.
In 2025, 593 missed-optimization bugs were closed, most of them
marked as fixed, 528 new ones were submitted. As of today, there
are 3672 missed-optimization bugs open, so we are looking at arount
a 6 year average turnover.
97 missed-optimization regressions were submitted in 2025, with
174 of them closed, with 318 missed-optimization regressions open
right now, so it is more of a two-year average turnover (and
there seems to be progress in reducing those).
I chose 2025 because it is easy to search for; it does not
correspond to a gcc release cycle.
I very rarely submit "missed optimization" bugs.
As far as I am concerned, missed optimization is not a bug, it is
business as usual, with the hope of improvement in the future.
My cases are nearly always "compiler tries too be too smart with
horrible consequences" rather than "compiler is too stupid".
Here is a good example
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
By chance, it is remotely related to Anton's case.
Michael S <already5chosen@yahoo.com> schrieb:
Here is a good example
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
By chance, it is remotely related to Anton's case.
That's a good one, and would apparently quite some work to fix.
I've pinged it, BTW.
By the way, regarding your quota of fixed bugs: I see 15 bugs
submitted, 3 as WONTFIX, 5 as FIXED. If you take out the WONTFIX
(for an architecture which is no longer supported due to lack of
a maintainer), you have a 42% success quota so far, at least for
the e-mail address in the PR, not 10-15% as you estimated.
I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John Hennessy's collection of small integer benchmarks.
The reason is that auto-vectorization turns two 4-byte stores into one 8-byte store, and
in the next iteration of bubble-sort two 4-byte loads are
auto-vectorized into an 8-byte load, but this load only partially
overlaps the store. This results in taking a slow path in
store-to-load forwarding. By contrast, without auto-vectorization the
stores and the loads are 4-byte wide, store-to-load forwarding sees a
full overlap, and a fast path is taken.
I found that gcc-14.2 is significantly less aggressive in vectorizing
than gcc-12.2, but still incurs the above-mentioned slowdown. But I
only checked that later. First I wondered whether gcc-14.2 would
still see a slowdown from auto-vectorization, and in which
store-to-load forwarding cases it would happen. You can find the
results at <https://www.complang.tuwien.ac.at/anton/stwlf/>.
For those who want the gist:
* Narrow (8-byte) completely overlapping store-to-load forwarding (all
those cases we see in the -O code) is fast on Zen 3 and Zen 4 in all
measured cases, and on the other microarchitectures in most measured
cases.
* Wide (16-byte) completely overlapping store-to-load forwarding (-O3
code fdor the wl>ws=>wl case) is significantly slower on those
machines where the non-vectorized counterpart is fast (Zen4, Zen3,
Gracemont), but on a number of uarchs the non-vectorized counterpart
has the same slowdown.
* Narrow-to-wide or partially overlapping wide-to-wide store-to-load
forwarding is very slow and tends to become slower (in cycles) with
newer generations. It is already slow if the dependency chain ends
soon after the wide load, and cases involving recurrences tend to be
even slower.
* Wide-store-to-narrow-load forwarding is cheap.
So it seems that unless the compiler has very good knowledge that the
wide load was not preceded by a recent store to one of the involved addresses, it is better not to vectorize two narrow loads into a wide
load.
- anton--- Synchronet 3.21b-Linux NewsLink 1.2
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks.
Have you considered how hard it is to track part of a register's
contents {RATHER THAN "In Its Entirety"} ??
in the next iteration of bubble-sort two 4-byte loads are
auto-vectorized into an 8-byte load, but this load only partially
overlaps the store. This results in taking a slow path in
store-to-load forwarding. By contrast, without auto-vectorization the
stores and the loads are 4-byte wide, store-to-load forwarding sees a
full overlap, and a fast path is taken.
Why does anyone run bubble sort these days ??
Testing compilers with bubble sort is akin to taking a model-T to
race against top fuel dragsters.
* Wide (16-byte) completely overlapping store-to-load forwarding (-O3
code fdor the wl>ws=>wl case) is significantly slower on those
machines where the non-vectorized counterpart is fast (Zen4, Zen3,
Gracemont), but on a number of uarchs the non-vectorized counterpart
has the same slowdown.
This is an argument that all -|Architectures should perform as close
to optimal as possible on a SINGLE code sequence. Compiler output
code should be near-optimal on every implementation. That it is not
has the fickle finger of fate pointing at ISA and -|Arch instead of
compiler or source code.
* Narrow-to-wide or partially overlapping wide-to-wide store-to-load
forwarding is very slow and tends to become slower (in cycles) with
newer generations. It is already slow if the dependency chain ends
soon after the wide load, and cases involving recurrences tend to be
even slower.
This is a FUNDAMENTALLY HARD problem. It is (and will remain being)
faster to never do this--it is the data-path equivalence to false
sharing of cache line in ATOMIC events.
Forwarding of full registers is already a BigO(n^2) problem
Forwarding of aligned partial registers a BigO(n^3) problem
Forwarding of random partial registers a BigO(n^4) problem
It is NEVER going to be fast (if the rest of the machine is fast).
It is often wise not to vectorize a lot of stuff--especailly
those things which are fundamentally hard.
ERROR "unexpected byte sequence starting at index 3317: '\xC2'" while decoding:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks.
Have you considered how hard it is to track part of a register's
contents {RATHER THAN "In Its Entirety"} ??
This is not about registers. This is about memory. The performance
hit for partial store-to-load forwarding seems to be even higher than
for partial register stalls.
However, the Pentium II seems to have reduced the penalty for partial-register stalls compared to the Pentium Pro. What did they
do?
Also, concerning partial-register stalls, the zero-extending semantics
of AMD64 prevented that for 32-bit writes followed by 64-bit loads of
GPRs. But the AVX and AVX-512 semantics merge a partial-register
write (e.g., of an XMM register on AVX-256) with the current content.
I think several microarchitectures have an optimization of that if the
high bits are 0.
in the next iteration of bubble-sort two 4-byte loads are
auto-vectorized into an 8-byte load, but this load only partially
overlaps the store. This results in taking a slow path in
store-to-load forwarding. By contrast, without auto-vectorization the
stores and the loads are 4-byte wide, store-to-load forwarding sees a
full overlap, and a fast path is taken.
Why does anyone run bubble sort these days ??
Because John Hennessy <https://en.wikipedia.org/wiki/John_L._Hennessy>
used it as a benchmark in the early 1980s, and it has been translated
to Forth by Marty Fraeman a few years later, which gives us a
benchmark where we can compare C compilers and Forth compilers. Why
did Hennessy choose bubble-sort? Quicksort and mergesort were
well-known at that time. He apparently collected small integer
benchmarks that were available at the time, and apparently bubble-sort satisfied his requirements even though on would use other sorting
algorithms unless the data has very specific properties.
In this case bubble-sort highlights a particular performance weakness
of a particular compiler on current microarchitectures, but I expect
that similar things happen in other code, even if they do not happen frequently enough to stick out like a sore thumb as it happens for bubble-sort.
E.g., in stack-based interpreters you will often have two writes to
the stack followed by loads of two stack items. If gcc
auto-vectorizes the two loads into a single wide load, the wide load
will probably see similar slowdowns.
Testing compilers with bubble sort is akin to taking a model-T to
race against top fuel dragsters.
Shouldn't a competent compiler compile bubble-sort to fast code, too?
gcc -O certainly does. clang -O3 does, too. gcc -O3
-fno-tree-vectorize does, too. But gcc -O3 falls into the partial store-to-load forwarding pitfall.
* Wide (16-byte) completely overlapping store-to-load forwarding (-O3
code fdor the wl>ws=>wl case) is significantly slower on those
machines where the non-vectorized counterpart is fast (Zen4, Zen3,
Gracemont), but on a number of uarchs the non-vectorized counterpart
has the same slowdown.
This is an argument that all |e-|Architectures should perform as close
to optimal as possible on a SINGLE code sequence. Compiler output
code should be near-optimal on every implementation. That it is not
has the fickle finger of fate pointing at ISA and |e-|Arch instead of >compiler or source code.
What problem do you see in the ISA?
How would you prevent it?
Concerning the uArch, sure it would be cool if the heroic
optimizations that make 8-byte fully overlapping store-to-load
forwarding fast on Zen3 and Zen4 and, in many cases, also on other
uArchs, would also apply to 16-byte fully overlapping store-to-load forwarding, but if you ask me whether I prefer Zen3/4 behaviour for
the wl>ws=>wl case to the Rocket Lake case, I actually prefer the
Zen3/4 behaviour:
Zen 4 | Zen 3 |Rocket Lake|
-O -O3 | -O -O3 | -O -O3 | dependencies
3.09 8.80|3.18 8.87| 5.12 5.04|wl>ws=>wl (recurrence), nl>ns (no recurrence)
Zen3/4 gives us a way to make this case very fast, and has no slow
cases for narrow full-width forwarding, at least not in the cases
looked at in this microbenchmark. By contrast, Rocket Lake has more
cases where narrow store-to-load forwarding performs
worse, but at least its wide-to-wide forwarding is not any worse.
* Narrow-to-wide or partially overlapping wide-to-wide store-to-load
forwarding is very slow and tends to become slower (in cycles) with
newer generations. It is already slow if the dependency chain ends
soon after the wide load, and cases involving recurrences tend to be
even slower.
This is a FUNDAMENTALLY HARD problem. It is (and will remain being)
faster to never do this--it is the data-path equivalence to false
sharing of cache line in ATOMIC events.
Forwarding of full registers is already a BigO(n^2) problem
Forwarding of aligned partial registers a BigO(n^3) problem
Forwarding of random partial registers a BigO(n^4) problem
It is NEVER going to be fast (if the rest of the machine is fast).
How would one achieve similar benefits for narrow-to-wide forwarding
or partial wide-to-wide forwarding.
--- Synchronet 3.21b-Linux NewsLink 1.2It is often wise not to vectorize a lot of stuff--especailly
those things which are fundamentally hard.
Exactly.
- anton
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks.
What is the PR number?
Michael S <already5chosen@yahoo.com> schrieb:
I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't
really count.
Maybe some figures to put this into perspective.
In 2025, 593 missed-optimization bugs were closed, most of them
marked as fixed, 528 new ones were submitted. As of today, there
are 3672 missed-optimization bugs open, so we are looking at arount
a 6 year average turnover.
97 missed-optimization regressions were submitted in 2025, with
174 of them closed, with 318 missed-optimization regressions open
right now, so it is more of a two-year average turnover (and
there seems to be progress in reducing those).
I chose 2025 because it is easy to search for; it does not
correspond to a gcc release cycle.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks.
Have you considered how hard it is to track part of a register's
contents {RATHER THAN "In Its Entirety"} ??
Consider a string of x86 instructions that write to different bytes
of the same register.
a) do you want to blow-up the forwarding path by 4|u ??
b) do you want each forwarding path-portion to select between
4 places in any result ??
My guess is no. Thus, quit using partial registers and get on with life.
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 59 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 00:15:53 |
| Calls: | 810 |
| Files: | 1,287 |
| Messages: | 197,330 |