Forum: Too Lazy BBS

store to wide load forwarding

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jan 31 11:33:30 2026

From Newsgroup: comp.arch

I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks. The reason is that auto-vectorization turns two 4-byte stores into one 8-byte store, and
in the next iteration of bubble-sort two 4-byte loads are
auto-vectorized into an 8-byte load, but this load only partially
overlaps the store. This results in taking a slow path in
store-to-load forwarding. By contrast, without auto-vectorization the
stores and the loads are 4-byte wide, store-to-load forwarding sees a
full overlap, and a fast path is taken.

I found that gcc-14.2 is significantly less aggressive in vectorizing
than gcc-12.2, but still incurs the above-mentioned slowdown. But I
only checked that later. First I wondered whether gcc-14.2 would
still see a slowdown from auto-vectorization, and in which
store-to-load forwarding cases it would happen. You can find the
results at <https://www.complang.tuwien.ac.at/anton/stwlf/>.

For those who want the gist:

* Narrow (8-byte) completely overlapping store-to-load forwarding (all
those cases we see in the -O code) is fast on Zen 3 and Zen 4 in all
measured cases, and on the other microarchitectures in most measured
cases.

* Wide (16-byte) completely overlapping store-to-load forwarding (-O3
code fdor the wl>ws=>wl case) is significantly slower on those
machines where the non-vectorized counterpart is fast (Zen4, Zen3,
Gracemont), but on a number of uarchs the non-vectorized counterpart
has the same slowdown.

* Narrow-to-wide or partially overlapping wide-to-wide store-to-load
forwarding is very slow and tends to become slower (in cycles) with
newer generations. It is already slow if the dependency chain ends
soon after the wide load, and cases involving recurrences tend to be
even slower.

* Wide-store-to-narrow-load forwarding is cheap.

So it seems that unless the compiler has very good knowledge that the
wide load was not preceded by a recent store to one of the involved
addresses, it is better not to vectorize two narrow loads into a wide
load.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21b-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Jan 31 12:27:40 2026

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John Hennessy's collection of small integer benchmarks.

What is the PR number?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Jan 31 21:11:24 2026

From Newsgroup: comp.arch

On Sat, 31 Jan 2026 18:42:49 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I have seen big slowdowns (factor 5.7 on a Rocket Lake with
gcc-14.2) from gcc's auto-vectorization for the bubble-sort
benchmark of John Hennessy's collection of small integer
benchmarks.

What is the PR number?

By now you should know that I consider gcc bug reports a waste of
time. I last told you that in
<2025Jul15.080403@mips.complang.tuwien.ac.at> and gave PR93811 as an
example where I have wasted my time with creating a PR, and the status
of this PR has not changed in the meantime.

You seem to think that it is worthwhile creating gcc bug reports, so
go ahead and create one yourself. I think the web page contains all information necessary, but if you miss something, let me know.

- anton

I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't really
count.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Jan 31 21:21:42 2026

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks.

What is the PR number?

By now you should know that I consider gcc bug reports a waste of
time.

Posting to this newsgroup certainly is, at least as far as actually accomplishing anything is concerned. Otherwise, you would have
at least a chance of having this fixed, especially if it is
a regression.

But let me qualify the above statement: Make a self-contained, small
test case, and I'll submit a PR for you.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 1 00:33:15 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't really
count.

The case in point appears to be a regression, which are supposed to be
fixed, and receive much higher attention than "normal" bugs.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 1 11:06:29 2026

From Newsgroup: comp.arch

On Sun, 1 Feb 2026 00:33:15 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't
really count.

The case in point appears to be a regression, which are supposed to be
fixed, and receive much higher attention than "normal" bugs.

According to my experience, it could receive higher attention than
average pessimization case, but there is close to zero chance
that it would be fixed at the end.
The typical scenario for such cases is that they "fall between chairs"
of tree optimization and target code generation and neither party is
taking responsibility.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 1 09:16:27 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't really
count.

Maybe some figures to put this into perspective.

In 2025, 593 missed-optimization bugs were closed, most of them
marked as fixed, 528 new ones were submitted. As of today, there
are 3672 missed-optimization bugs open, so we are looking at arount
a 6 year average turnover.

97 missed-optimization regressions were submitted in 2025, with
174 of them closed, with 318 missed-optimization regressions open
right now, so it is more of a two-year average turnover (and
there seems to be progress in reducing those).

I chose 2025 because it is easy to search for; it does not
correspond to a gcc release cycle.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 1 09:17:11 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

On Sun, 1 Feb 2026 00:33:15 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't
really count.

The case in point appears to be a regression, which are supposed to be
fixed, and receive much higher attention than "normal" bugs.

According to my experience, it could receive higher attention than
average pessimization case, but there is close to zero chance
that it would be fixed at the end.
The typical scenario for such cases is that they "fall between chairs"
of tree optimization and target code generation and neither party is
taking responsibility.

Do you have an example?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21b-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sun Feb 1 11:50:29 2026

From Newsgroup: comp.arch

On 01/02/2026 01:33, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't really
count.

As have I.

There is also the fact that not all changes or improvements that might
have been inspired by bug reports lead to comments or closure on the bug report itself. Like all large development projects, there's a mismatch between the people interested in doing the technical stuff and improving
the program, and the interest in paperwork and bureaucracy. That is particularly true for things that involve larger, more structural or algorithmic changes, rather than individual small patches.

The case in point appears to be a regression, which are supposed to be
fixed, and receive much higher attention than "normal" bugs.

Yes, that's the idea. You still won't get 100% hit rate, but it will be higher than for general "missed optimisation" reports.

And of course with any major changes to code generation, there are
likely to be some regressions - if it's 3 steps forward and 1 step back,
it can be a positive improvement in general even though there are
regressions. If that's the case, then the answer is likely to be a
compiler flag or tuneable for helping the particular code.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 1 11:31:56 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

I very rarely submit "missed optimization" bugs.
As far as I am concerned, missed optimization is not a bug, it is
business as usual, with the hope of improvement in the future.

It still makes sense to submit a PR (which stands for Problem
Report) so people can look at it when they want to improve
code generation. I have submitted quite a few of these - of the
733 PRs I have submitted so far, 120 were missed-optimization.

My cases are nearly always "compiler tries too be too smart with
horrible consequences" rather than "compiler is too stupid".

Keyword wrong-code then, or suboptimal code generation? The latter
is also classified as missed-optimiztation.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 1 13:35:16 2026

From Newsgroup: comp.arch

On Sun, 1 Feb 2026 09:17:11 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Sun, 1 Feb 2026 00:33:15 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't
really count.

The case in point appears to be a regression, which are supposed
to be fixed, and receive much higher attention than "normal" bugs.

According to my experience, it could receive higher attention than
average pessimization case, but there is close to zero chance
that it would be fixed at the end.
The typical scenario for such cases is that they "fall between
chairs" of tree optimization and target code generation and neither
party is taking responsibility.

Do you have an example?

Here is a good example
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
By chance, it is remotely related to Anton's case.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 1 13:54:11 2026

From Newsgroup: comp.arch

On Sun, 1 Feb 2026 13:18:17 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Sun, 1 Feb 2026 09:16:27 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't
really count.

Maybe some figures to put this into perspective.

In 2025, 593 missed-optimization bugs were closed, most of them
marked as fixed, 528 new ones were submitted. As of today, there
are 3672 missed-optimization bugs open, so we are looking at arount
a 6 year average turnover.

97 missed-optimization regressions were submitted in 2025, with
174 of them closed, with 318 missed-optimization regressions open
right now, so it is more of a two-year average turnover (and
there seems to be progress in reducing those).

I chose 2025 because it is easy to search for; it does not
correspond to a gcc release cycle.

I very rarely submit "missed optimization" bugs.
As far as I am concerned, missed optimization is not a bug, it is
business as usual, with the hope of improvement in the future.
My cases are nearly always "compiler tries too be too smart with
horrible consequences" rather than "compiler is too stupid".

I just re-checked one of very few cases where what I reported could be
properly called "missed optimization". https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86975

It was fixed in gcc14, 5 or 6 years later.

https://godbolt.org/z/T1o34ean8

"The mills of Gcc maintanance grind slow, but they grind exceedingly
fine"

Unfortunately, only on MIPS, for which I care very little.
It was not fixed on Nios2, the architecture that I really care about,
because Nios2 is no longer supported by gcc.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 1 11:58:58 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

Here is a good example
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
By chance, it is remotely related to Anton's case.

That's a good one, and would apparently quite some work to fix.
I've pinged it, BTW.

By the way, regarding your quota of fixed bugs: I see 15 bugs
submitted, 3 as WONTFIX, 5 as FIXED. If you take out the WONTFIX
(for an architecture which is no longer supported due to lack of
a maintainer), you have a 42% success quota so far, at least for
the e-mail address in the PR, not 10-15% as you estimated.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 1 14:11:18 2026

From Newsgroup: comp.arch

On Sun, 1 Feb 2026 11:58:58 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

Here is a good example
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
By chance, it is remotely related to Anton's case.

That's a good one, and would apparently quite some work to fix.
I've pinged it, BTW.

By the way, regarding your quota of fixed bugs: I see 15 bugs
submitted, 3 as WONTFIX, 5 as FIXED. If you take out the WONTFIX
(for an architecture which is no longer supported due to lack of
a maintainer), you have a 42% success quota so far, at least for
the e-mail address in the PR, not 10-15% as you estimated.

Here is one case that I did not submit [not just out of laziness, but
also], because I was not sure whether it is bug or feature. https://www.realworldtech.com/forum/?threadid=226267&curpostid=226267
Although the observation made by Freddie causes me to believe that the
change of behavior was not intentional.

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 5 01:48:35 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John Hennessy's collection of small integer benchmarks.

Have you considered how hard it is to track part of a register's
contents {RATHER THAN "In Its Entirety"} ??

Consider a string of x86 instructions that write to different bytes
of the same register.
a) do you want to blow-up the forwarding path by 4|u ??
b) do you want each forwarding path-portion to select between
4 places in any result ??

My guess is no. Thus, quit using partial registers and get on with life.

The reason is that auto-vectorization turns two 4-byte stores into one 8-byte store, and

With established SIMD semantics, auto-vectorization is dangerous.
With better semantics--maybe not quite as bad.

in the next iteration of bubble-sort two 4-byte loads are
auto-vectorized into an 8-byte load, but this load only partially
overlaps the store. This results in taking a slow path in
store-to-load forwarding. By contrast, without auto-vectorization the
stores and the loads are 4-byte wide, store-to-load forwarding sees a
full overlap, and a fast path is taken.

Why does anyone run bubble sort these days ??
Use Merge sort or quicksort depending on properties of your data.

Testing compilers with bubble sort is akin to taking a model-T to
race against top fuel dragsters.

I found that gcc-14.2 is significantly less aggressive in vectorizing
than gcc-12.2, but still incurs the above-mentioned slowdown. But I
only checked that later. First I wondered whether gcc-14.2 would
still see a slowdown from auto-vectorization, and in which
store-to-load forwarding cases it would happen. You can find the
results at <https://www.complang.tuwien.ac.at/anton/stwlf/>.

For those who want the gist:

* Narrow (8-byte) completely overlapping store-to-load forwarding (all
those cases we see in the -O code) is fast on Zen 3 and Zen 4 in all
measured cases, and on the other microarchitectures in most measured
cases.

* Wide (16-byte) completely overlapping store-to-load forwarding (-O3
code fdor the wl>ws=>wl case) is significantly slower on those
machines where the non-vectorized counterpart is fast (Zen4, Zen3,
Gracemont), but on a number of uarchs the non-vectorized counterpart
has the same slowdown.

This is an argument that all -|Architectures should perform as close
to optimal as possible on a SINGLE code sequence. Compiler output
code should be near-optimal on every implementation. That it is not
has the fickle finger of fate pointing at ISA and -|Arch instead of
compiler or source code.

* Narrow-to-wide or partially overlapping wide-to-wide store-to-load
forwarding is very slow and tends to become slower (in cycles) with
newer generations. It is already slow if the dependency chain ends
soon after the wide load, and cases involving recurrences tend to be
even slower.

This is a FUNDAMENTALLY HARD problem. It is (and will remain being)
faster to never do this--it is the data-path equivalence to false
sharing of cache line in ATOMIC events.

Forwarding of full registers is already a BigO(n^2) problem
Forwarding of aligned partial registers a BigO(n^3) problem
Forwarding of random partial registers a BigO(n^4) problem

It is NEVER going to be fast (if the rest of the machine is fast).

* Wide-store-to-narrow-load forwarding is cheap.

So it seems that unless the compiler has very good knowledge that the
wide load was not preceded by a recent store to one of the involved addresses, it is better not to vectorize two narrow loads into a wide
load.

It is often wise not to vectorize a lot of stuff--especailly
those things which are fundamentally hard.

- anton

--- Synchronet 3.21b-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Feb 5 15:26:22 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks.

Have you considered how hard it is to track part of a register's
contents {RATHER THAN "In Its Entirety"} ??

This is not about registers. This is about memory. The performance
hit for partial store-to-load forwarding seems to be even higher than
for partial register stalls.

However, the Pentium II seems to have reduced the penalty for
partial-register stalls compared to the Pentium Pro. What did they
do?

Also, concerning partial-register stalls, the zero-extending semantics
of AMD64 prevented that for 32-bit writes followed by 64-bit loads of
GPRs. But the AVX and AVX-512 semantics merge a partial-register
write (e.g., of an XMM register on AVX-256) with the current content.
I think several microarchitectures have an optimization of that if the
high bits are 0.

in the next iteration of bubble-sort two 4-byte loads are
auto-vectorized into an 8-byte load, but this load only partially
overlaps the store. This results in taking a slow path in
store-to-load forwarding. By contrast, without auto-vectorization the
stores and the loads are 4-byte wide, store-to-load forwarding sees a
full overlap, and a fast path is taken.

Why does anyone run bubble sort these days ??

Because John Hennessy <https://en.wikipedia.org/wiki/John_L._Hennessy>
used it as a benchmark in the early 1980s, and it has been translated
to Forth by Marty Fraeman a few years later, which gives us a
benchmark where we can compare C compilers and Forth compilers. Why
did Hennessy choose bubble-sort? Quicksort and mergesort were
well-known at that time. He apparently collected small integer
benchmarks that were available at the time, and apparently bubble-sort satisfied his requirements even though on would use other sorting
algorithms unless the data has very specific properties.

In this case bubble-sort highlights a particular performance weakness
of a particular compiler on current microarchitectures, but I expect
that similar things happen in other code, even if they do not happen
frequently enough to stick out like a sore thumb as it happens for
bubble-sort.

E.g., in stack-based interpreters you will often have two writes to
the stack followed by loads of two stack items. If gcc
auto-vectorizes the two loads into a single wide load, the wide load
will probably see similar slowdowns.

Testing compilers with bubble sort is akin to taking a model-T to
race against top fuel dragsters.

Shouldn't a competent compiler compile bubble-sort to fast code, too?
gcc -O certainly does. clang -O3 does, too. gcc -O3
-fno-tree-vectorize does, too. But gcc -O3 falls into the partial store-to-load forwarding pitfall.

* Wide (16-byte) completely overlapping store-to-load forwarding (-O3
code fdor the wl>ws=>wl case) is significantly slower on those
machines where the non-vectorized counterpart is fast (Zen4, Zen3,
Gracemont), but on a number of uarchs the non-vectorized counterpart
has the same slowdown.

This is an argument that all -|Architectures should perform as close
to optimal as possible on a SINGLE code sequence. Compiler output
code should be near-optimal on every implementation. That it is not
has the fickle finger of fate pointing at ISA and -|Arch instead of
compiler or source code.

What problem do you see in the ISA? How would you prevent it?

Concerning the uArch, sure it would be cool if the heroic
optimizations that make 8-byte fully overlapping store-to-load
forwarding fast on Zen3 and Zen4 and, in many cases, also on other
uArchs, would also apply to 16-byte fully overlapping store-to-load
forwarding, but if you ask me whether I prefer Zen3/4 behaviour for
the wl>ws=>wl case to the Rocket Lake case, I actually prefer the
Zen3/4 behaviour:

Zen 4 | Zen 3 |Rocket Lake|
-O -O3 | -O -O3 | -O -O3 | dependencies
3.09 8.80|3.18 8.87| 5.12 5.04|wl>ws=>wl (recurrence), nl>ns (no recurrence)

Zen3/4 gives us a way to make this case very fast, and has no slow
cases for narrow full-width forwarding, at least not in the cases
looked at in this microbenchmark. By contrast, Rocket Lake has more
cases where narrow store-to-load forwarding performs
worse, but at least its wide-to-wide forwarding is not any worse.

* Narrow-to-wide or partially overlapping wide-to-wide store-to-load
forwarding is very slow and tends to become slower (in cycles) with
newer generations. It is already slow if the dependency chain ends
soon after the wide load, and cases involving recurrences tend to be
even slower.

This is a FUNDAMENTALLY HARD problem. It is (and will remain being)
faster to never do this--it is the data-path equivalence to false
sharing of cache line in ATOMIC events.

Forwarding of full registers is already a BigO(n^2) problem
Forwarding of aligned partial registers a BigO(n^3) problem
Forwarding of random partial registers a BigO(n^4) problem

It is NEVER going to be fast (if the rest of the machine is fast).

How would one achieve similar benefits for narrow-to-wide forwarding
or partial wide-to-wide forwarding. One possibility would be to split
the wide load into several loads (but that has lots of possible cases:
Consider two byte stores to non-adjacent addresses followed by a
512-bit load that includes both of these byte store addresses),
possibly using a history-based predictor for determining whether the
load is going to overlap a recently-stored-to address range; there
already is (or was in an Intel processor, at one time, around Haswell
or so) such a predictor to allow alias detection for full-width
store-to-load forwarding, maybe that could be enhanced for the partially-overlapping case. Still it seems hard to achieve that, and
the benefit is probably small on most programs.

OTOH, the compiler has a hard time knowing whether some recent store
has been to overlapping memory, so a hardware solution would certainly
be ideal, just as it has been in other cases where the compiler has insufficient knowledge (e.g., branch prediction).

But given what should a compiler that targets present hardware do? I
think it should be more reluctant in applying this kind of
vectorization, which I have found to be called SLP (superword-level parallelism) and can be disabled in gcc with -ftree-slp-vectorize.

It is often wise not to vectorize a lot of stuff--especailly
those things which are fundamentally hard.

Exactly.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 5 19:20:02 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 3317: '\xC2'" while decoding:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks.

Have you considered how hard it is to track part of a register's
contents {RATHER THAN "In Its Entirety"} ??

This is not about registers. This is about memory. The performance
hit for partial store-to-load forwarding seems to be even higher than
for partial register stalls.

Perhaps there is a rule in the memory model that individual portions
of a cache line have to be updated (written) in program order, and
the delay you see is the ST queue waiting for data to arrive.

However, the Pentium II seems to have reduced the penalty for partial-register stalls compared to the Pentium Pro. What did they
do?

Went BigO(n^3) on forwarding.

Also, concerning partial-register stalls, the zero-extending semantics
of AMD64 prevented that for 32-bit writes followed by 64-bit loads of
GPRs. But the AVX and AVX-512 semantics merge a partial-register
write (e.g., of an XMM register on AVX-256) with the current content.
I think several microarchitectures have an optimization of that if the
high bits are 0.

in the next iteration of bubble-sort two 4-byte loads are
auto-vectorized into an 8-byte load, but this load only partially
overlaps the store. This results in taking a slow path in
store-to-load forwarding. By contrast, without auto-vectorization the
stores and the loads are 4-byte wide, store-to-load forwarding sees a
full overlap, and a fast path is taken.

Why does anyone run bubble sort these days ??

Because John Hennessy <https://en.wikipedia.org/wiki/John_L._Hennessy>
used it as a benchmark in the early 1980s, and it has been translated
to Forth by Marty Fraeman a few years later, which gives us a
benchmark where we can compare C compilers and Forth compilers. Why
did Hennessy choose bubble-sort? Quicksort and mergesort were
well-known at that time. He apparently collected small integer
benchmarks that were available at the time, and apparently bubble-sort satisfied his requirements even though on would use other sorting
algorithms unless the data has very specific properties.

In this case bubble-sort highlights a particular performance weakness
of a particular compiler on current microarchitectures, but I expect
that similar things happen in other code, even if they do not happen frequently enough to stick out like a sore thumb as it happens for bubble-sort.

E.g., in stack-based interpreters you will often have two writes to
the stack followed by loads of two stack items. If gcc
auto-vectorizes the two loads into a single wide load, the wide load
will probably see similar slowdowns.

Here is where HW is a better auto-vectorizer than SW. HW can run the
2 accesses to the stack as RAW hazards resolve a single wider read
has to wait for both portions to be written. Thus, the data-flow
dependencies are easier with throwing SIMD at the problem(s).

Testing compilers with bubble sort is akin to taking a model-T to
race against top fuel dragsters.

Shouldn't a competent compiler compile bubble-sort to fast code, too?

A fast BigO(n^3) problem can never compete with a mediocre BigO(n*ln(n)) problem.

gcc -O certainly does. clang -O3 does, too. gcc -O3
-fno-tree-vectorize does, too. But gcc -O3 falls into the partial store-to-load forwarding pitfall.

What happens in the quicksort() compiled code ??

* Wide (16-byte) completely overlapping store-to-load forwarding (-O3
code fdor the wl>ws=>wl case) is significantly slower on those
machines where the non-vectorized counterpart is fast (Zen4, Zen3,
Gracemont), but on a number of uarchs the non-vectorized counterpart
has the same slowdown.

This is an argument that all |e-|Architectures should perform as close
to optimal as possible on a SINGLE code sequence. Compiler output
code should be near-optimal on every implementation. That it is not
has the fickle finger of fate pointing at ISA and |e-|Arch instead of >compiler or source code.

What problem do you see in the ISA?

Too many additions over the decades make it impossible for the compiler
to guess correctly all the time, and so many mostly-compatible additions
have made it wearisome for the programmer to understand it all just so
he can properly inform compiler of this idiosyncrasy here and that there.

How would you prevent it?

not use x86....

Concerning the uArch, sure it would be cool if the heroic
optimizations that make 8-byte fully overlapping store-to-load
forwarding fast on Zen3 and Zen4 and, in many cases, also on other
uArchs, would also apply to 16-byte fully overlapping store-to-load forwarding, but if you ask me whether I prefer Zen3/4 behaviour for
the wl>ws=>wl case to the Rocket Lake case, I actually prefer the
Zen3/4 behaviour:

Zen 4 | Zen 3 |Rocket Lake|
-O -O3 | -O -O3 | -O -O3 | dependencies
3.09 8.80|3.18 8.87| 5.12 5.04|wl>ws=>wl (recurrence), nl>ns (no recurrence)

Zen3/4 gives us a way to make this case very fast, and has no slow
cases for narrow full-width forwarding, at least not in the cases
looked at in this microbenchmark. By contrast, Rocket Lake has more
cases where narrow store-to-load forwarding performs
worse, but at least its wide-to-wide forwarding is not any worse.

* Narrow-to-wide or partially overlapping wide-to-wide store-to-load
forwarding is very slow and tends to become slower (in cycles) with
newer generations. It is already slow if the dependency chain ends
soon after the wide load, and cases involving recurrences tend to be
even slower.

This is a FUNDAMENTALLY HARD problem. It is (and will remain being)
faster to never do this--it is the data-path equivalence to false
sharing of cache line in ATOMIC events.

Forwarding of full registers is already a BigO(n^2) problem
Forwarding of aligned partial registers a BigO(n^3) problem
Forwarding of random partial registers a BigO(n^4) problem

It is NEVER going to be fast (if the rest of the machine is fast).

How would one achieve similar benefits for narrow-to-wide forwarding
or partial wide-to-wide forwarding.

Remove it from the compiler, and make HW do something about it.

It is often wise not to vectorize a lot of stuff--especailly
those things which are fundamentally hard.

Exactly.

- anton

--- Synchronet 3.21b-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jan 31 18:42:49 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks.

What is the PR number?

By now you should know that I consider gcc bug reports a waste of
time. I last told you that in
<2025Jul15.080403@mips.complang.tuwien.ac.at> and gave PR93811 as an
example where I have wasted my time with creating a PR, and the status
of this PR has not changed in the meantime.

You seem to think that it is worthwhile creating gcc bug reports, so
go ahead and create one yourself. I think the web page contains all information necessary, but if you miss something, let me know.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21b-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 1 13:18:17 2026

From Newsgroup: comp.arch

On Sun, 1 Feb 2026 09:16:27 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

I had above-zero success rate with gcc bug reports related to
compiler pessimization. May be, 10%. May be even 15%. I didn't
really count.

Maybe some figures to put this into perspective.

In 2025, 593 missed-optimization bugs were closed, most of them
marked as fixed, 528 new ones were submitted. As of today, there
are 3672 missed-optimization bugs open, so we are looking at arount
a 6 year average turnover.

97 missed-optimization regressions were submitted in 2025, with
174 of them closed, with 318 missed-optimization regressions open
right now, so it is more of a two-year average turnover (and
there seems to be progress in reducing those).

I chose 2025 because it is easy to search for; it does not
correspond to a gcc release cycle.

I very rarely submit "missed optimization" bugs.
As far as I am concerned, missed optimization is not a bug, it is
business as usual, with the hope of improvement in the future.
My cases are nearly always "compiler tries too be too smart with
horrible consequences" rather than "compiler is too stupid".

--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 15:11:44 2026

From Newsgroup: comp.arch

On 2/4/26 8:48 PM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
from gcc's auto-vectorization for the bubble-sort benchmark of John
Hennessy's collection of small integer benchmarks.

Have you considered how hard it is to track part of a register's
contents {RATHER THAN "In Its Entirety"} ??

If half-register concatenation is not supported, I think one
could avoid a significant amount of complexity. Even supporting
concatenation only for stores might be similar to ARM's store
pair support. If a RAT is used, one would double the number of
entries. If full-sized values are allocated two distinct half-
sized physical registers (to reduce physical size), they would
be similar to paired register operations. If half-sized values
are allocated to full-sized registers, a little simplification
might be possible. Allocating and freeing twice as many physical
registers per cycle seems rather painful.

With full-sized registers, the lower of a pair of RAT entries
could be used for full-sized and lower-half operands allowing
concatenation errors to be detected when a full-sized operand's
lower-half RAT entry does not match the upper half (or the upper
half if not zero depending on whether one writes full values by
replication or "zero extension").

If the point is to provide more named values, concatenation does
not seem to be important.

Consider a string of x86 instructions that write to different bytes
of the same register.
a) do you want to blow-up the forwarding path by 4|u ??
b) do you want each forwarding path-portion to select between
4 places in any result ??

If one limits supported operations to add/subtract and the
bitwise logical operations, it seems that one could handle the
operations that only use upper or lower halfs for sources and
destinations without much extra logic. Operations with upper and
lower sources and an upper destination might only need to
convert the operations to a shift-by-32-and-operate of the lower
operand and perhaps some of the time for carry propagation could
be taken to cover the shift latency? Mixed inputs and lower
output would not be able to hide extra shift latency; not
providing such operations or having such have two-cycle latency
to allow a post-calculation 32-bit shift might be acceptable.

(Variable shift might be allowed if the variable is in a lower
half.)

My guess is no. Thus, quit using partial registers and get on with life.

I have certainly not thought deeply about the costs and probably
lack sufficient knowledge of hardware design to make reasonable
estimates. Yet for z/ARchitecture IBM chose to support half-
sized operations to increase the number of independent values
that can be in registers, admittedly because of the encoding
constraint of a legacy architecture. If I recall correctly, AMD
also chose to support double-"native" width of SIMD by using
twice as many operands, which is a purely microarchitectural
choice prioritizing resources for "native" width while
supporting instructions with twice the SIMD width.
--- Synchronet 3.21b-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	59
Nodes:	6 (0 / 6)
Uptime:	00:15:53
Calls:	810
Files:	1,287
Messages:	197,330

store to wide load forwarding

Who's Online

System Info