• store to wide load forwarding

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jan 31 11:33:30 2026
    From Newsgroup: comp.arch

    I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
    from gcc's auto-vectorization for the bubble-sort benchmark of John
    Hennessy's collection of small integer benchmarks. The reason is that auto-vectorization turns two 4-byte stores into one 8-byte store, and
    in the next iteration of bubble-sort two 4-byte loads are
    auto-vectorized into an 8-byte load, but this load only partially
    overlaps the store. This results in taking a slow path in
    store-to-load forwarding. By contrast, without auto-vectorization the
    stores and the loads are 4-byte wide, store-to-load forwarding sees a
    full overlap, and a fast path is taken.

    I found that gcc-14.2 is significantly less aggressive in vectorizing
    than gcc-12.2, but still incurs the above-mentioned slowdown. But I
    only checked that later. First I wondered whether gcc-14.2 would
    still see a slowdown from auto-vectorization, and in which
    store-to-load forwarding cases it would happen. You can find the
    results at <https://www.complang.tuwien.ac.at/anton/stwlf/>.

    For those who want the gist:

    * Narrow (8-byte) completely overlapping store-to-load forwarding (all
    those cases we see in the -O code) is fast on Zen 3 and Zen 4 in all
    measured cases, and on the other microarchitectures in most measured
    cases.

    * Wide (16-byte) completely overlapping store-to-load forwarding (-O3
    code fdor the wl>ws=>wl case) is significantly slower on those
    machines where the non-vectorized counterpart is fast (Zen4, Zen3,
    Gracemont), but on a number of uarchs the non-vectorized counterpart
    has the same slowdown.

    * Narrow-to-wide or partially overlapping wide-to-wide store-to-load
    forwarding is very slow and tends to become slower (in cycles) with
    newer generations. It is already slow if the dependency chain ends
    soon after the wide load, and cases involving recurrences tend to be
    even slower.

    * Wide-store-to-narrow-load forwarding is cheap.

    So it seems that unless the compiler has very good knowledge that the
    wide load was not preceded by a recent store to one of the involved
    addresses, it is better not to vectorize two narrow loads into a wide
    load.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Jan 31 12:27:40 2026
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
    from gcc's auto-vectorization for the bubble-sort benchmark of John Hennessy's collection of small integer benchmarks.

    What is the PR number?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Jan 31 21:11:24 2026
    From Newsgroup: comp.arch

    On Sat, 31 Jan 2026 18:42:49 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I have seen big slowdowns (factor 5.7 on a Rocket Lake with
    gcc-14.2) from gcc's auto-vectorization for the bubble-sort
    benchmark of John Hennessy's collection of small integer
    benchmarks.

    What is the PR number?

    By now you should know that I consider gcc bug reports a waste of
    time. I last told you that in
    <2025Jul15.080403@mips.complang.tuwien.ac.at> and gave PR93811 as an
    example where I have wasted my time with creating a PR, and the status
    of this PR has not changed in the meantime.

    You seem to think that it is worthwhile creating gcc bug reports, so
    go ahead and create one yourself. I think the web page contains all information necessary, but if you miss something, let me know.

    - anton

    I had above-zero success rate with gcc bug reports related to
    compiler pessimization. May be, 10%. May be even 15%. I didn't really
    count.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Jan 31 21:21:42 2026
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
    from gcc's auto-vectorization for the bubble-sort benchmark of John
    Hennessy's collection of small integer benchmarks.

    What is the PR number?

    By now you should know that I consider gcc bug reports a waste of
    time.

    Posting to this newsgroup certainly is, at least as far as actually accomplishing anything is concerned. Otherwise, you would have
    at least a chance of having this fixed, especially if it is
    a regression.

    But let me qualify the above statement: Make a self-contained, small
    test case, and I'll submit a PR for you.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 1 00:33:15 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:

    I had above-zero success rate with gcc bug reports related to
    compiler pessimization. May be, 10%. May be even 15%. I didn't really
    count.

    The case in point appears to be a regression, which are supposed to be
    fixed, and receive much higher attention than "normal" bugs.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 1 11:06:29 2026
    From Newsgroup: comp.arch

    On Sun, 1 Feb 2026 00:33:15 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    I had above-zero success rate with gcc bug reports related to
    compiler pessimization. May be, 10%. May be even 15%. I didn't
    really count.

    The case in point appears to be a regression, which are supposed to be
    fixed, and receive much higher attention than "normal" bugs.


    According to my experience, it could receive higher attention than
    average pessimization case, but there is close to zero chance
    that it would be fixed at the end.
    The typical scenario for such cases is that they "fall between chairs"
    of tree optimization and target code generation and neither party is
    taking responsibility.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 1 09:16:27 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:

    I had above-zero success rate with gcc bug reports related to
    compiler pessimization. May be, 10%. May be even 15%. I didn't really
    count.

    Maybe some figures to put this into perspective.

    In 2025, 593 missed-optimization bugs were closed, most of them
    marked as fixed, 528 new ones were submitted. As of today, there
    are 3672 missed-optimization bugs open, so we are looking at arount
    a 6 year average turnover.

    97 missed-optimization regressions were submitted in 2025, with
    174 of them closed, with 318 missed-optimization regressions open
    right now, so it is more of a two-year average turnover (and
    there seems to be progress in reducing those).

    I chose 2025 because it is easy to search for; it does not
    correspond to a gcc release cycle.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 1 09:17:11 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 1 Feb 2026 00:33:15 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    I had above-zero success rate with gcc bug reports related to
    compiler pessimization. May be, 10%. May be even 15%. I didn't
    really count.

    The case in point appears to be a regression, which are supposed to be
    fixed, and receive much higher attention than "normal" bugs.


    According to my experience, it could receive higher attention than
    average pessimization case, but there is close to zero chance
    that it would be fixed at the end.
    The typical scenario for such cases is that they "fall between chairs"
    of tree optimization and target code generation and neither party is
    taking responsibility.

    Do you have an example?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sun Feb 1 11:50:29 2026
    From Newsgroup: comp.arch

    On 01/02/2026 01:33, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:

    I had above-zero success rate with gcc bug reports related to
    compiler pessimization. May be, 10%. May be even 15%. I didn't really
    count.

    As have I.

    There is also the fact that not all changes or improvements that might
    have been inspired by bug reports lead to comments or closure on the bug report itself. Like all large development projects, there's a mismatch between the people interested in doing the technical stuff and improving
    the program, and the interest in paperwork and bureaucracy. That is particularly true for things that involve larger, more structural or algorithmic changes, rather than individual small patches.


    The case in point appears to be a regression, which are supposed to be
    fixed, and receive much higher attention than "normal" bugs.


    Yes, that's the idea. You still won't get 100% hit rate, but it will be higher than for general "missed optimisation" reports.

    And of course with any major changes to code generation, there are
    likely to be some regressions - if it's 3 steps forward and 1 step back,
    it can be a positive improvement in general even though there are
    regressions. If that's the case, then the answer is likely to be a
    compiler flag or tuneable for helping the particular code.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 1 11:31:56 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:

    I very rarely submit "missed optimization" bugs.
    As far as I am concerned, missed optimization is not a bug, it is
    business as usual, with the hope of improvement in the future.

    It still makes sense to submit a PR (which stands for Problem
    Report) so people can look at it when they want to improve
    code generation. I have submitted quite a few of these - of the
    733 PRs I have submitted so far, 120 were missed-optimization.

    My cases are nearly always "compiler tries too be too smart with
    horrible consequences" rather than "compiler is too stupid".

    Keyword wrong-code then, or suboptimal code generation? The latter
    is also classified as missed-optimiztation.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 1 13:35:16 2026
    From Newsgroup: comp.arch

    On Sun, 1 Feb 2026 09:17:11 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 1 Feb 2026 00:33:15 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    I had above-zero success rate with gcc bug reports related to
    compiler pessimization. May be, 10%. May be even 15%. I didn't
    really count.

    The case in point appears to be a regression, which are supposed
    to be fixed, and receive much higher attention than "normal" bugs.


    According to my experience, it could receive higher attention than
    average pessimization case, but there is close to zero chance
    that it would be fixed at the end.
    The typical scenario for such cases is that they "fall between
    chairs" of tree optimization and target code generation and neither
    party is taking responsibility.

    Do you have an example?

    Here is a good example
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
    By chance, it is remotely related to Anton's case.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 1 13:54:11 2026
    From Newsgroup: comp.arch

    On Sun, 1 Feb 2026 13:18:17 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Sun, 1 Feb 2026 09:16:27 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    I had above-zero success rate with gcc bug reports related to
    compiler pessimization. May be, 10%. May be even 15%. I didn't
    really count.

    Maybe some figures to put this into perspective.

    In 2025, 593 missed-optimization bugs were closed, most of them
    marked as fixed, 528 new ones were submitted. As of today, there
    are 3672 missed-optimization bugs open, so we are looking at arount
    a 6 year average turnover.

    97 missed-optimization regressions were submitted in 2025, with
    174 of them closed, with 318 missed-optimization regressions open
    right now, so it is more of a two-year average turnover (and
    there seems to be progress in reducing those).

    I chose 2025 because it is easy to search for; it does not
    correspond to a gcc release cycle.


    I very rarely submit "missed optimization" bugs.
    As far as I am concerned, missed optimization is not a bug, it is
    business as usual, with the hope of improvement in the future.
    My cases are nearly always "compiler tries too be too smart with
    horrible consequences" rather than "compiler is too stupid".


    I just re-checked one of very few cases where what I reported could be
    properly called "missed optimization". https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86975

    It was fixed in gcc14, 5 or 6 years later.

    https://godbolt.org/z/T1o34ean8

    "The mills of Gcc maintanance grind slow, but they grind exceedingly
    fine"

    Unfortunately, only on MIPS, for which I care very little.
    It was not fixed on Nios2, the architecture that I really care about,
    because Nios2 is no longer supported by gcc.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Feb 1 11:58:58 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:

    Here is a good example
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
    By chance, it is remotely related to Anton's case.

    That's a good one, and would apparently quite some work to fix.
    I've pinged it, BTW.

    By the way, regarding your quota of fixed bugs: I see 15 bugs
    submitted, 3 as WONTFIX, 5 as FIXED. If you take out the WONTFIX
    (for an architecture which is no longer supported due to lack of
    a maintainer), you have a 42% success quota so far, at least for
    the e-mail address in the PR, not 10-15% as you estimated.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 1 14:11:18 2026
    From Newsgroup: comp.arch

    On Sun, 1 Feb 2026 11:58:58 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    Here is a good example
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
    By chance, it is remotely related to Anton's case.

    That's a good one, and would apparently quite some work to fix.
    I've pinged it, BTW.

    By the way, regarding your quota of fixed bugs: I see 15 bugs
    submitted, 3 as WONTFIX, 5 as FIXED. If you take out the WONTFIX
    (for an architecture which is no longer supported due to lack of
    a maintainer), you have a 42% success quota so far, at least for
    the e-mail address in the PR, not 10-15% as you estimated.


    Here is one case that I did not submit [not just out of laziness, but
    also], because I was not sure whether it is bug or feature. https://www.realworldtech.com/forum/?threadid=226267&curpostid=226267
    Although the observation made by Freddie causes me to believe that the
    change of behavior was not intentional.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 5 01:48:35 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
    from gcc's auto-vectorization for the bubble-sort benchmark of John Hennessy's collection of small integer benchmarks.

    Have you considered how hard it is to track part of a register's
    contents {RATHER THAN "In Its Entirety"} ??

    Consider a string of x86 instructions that write to different bytes
    of the same register.
    a) do you want to blow-up the forwarding path by 4|u ??
    b) do you want each forwarding path-portion to select between
    4 places in any result ??

    My guess is no. Thus, quit using partial registers and get on with life.

    The reason is that auto-vectorization turns two 4-byte stores into one 8-byte store, and

    With established SIMD semantics, auto-vectorization is dangerous.
    With better semantics--maybe not quite as bad.

    in the next iteration of bubble-sort two 4-byte loads are
    auto-vectorized into an 8-byte load, but this load only partially
    overlaps the store. This results in taking a slow path in
    store-to-load forwarding. By contrast, without auto-vectorization the
    stores and the loads are 4-byte wide, store-to-load forwarding sees a
    full overlap, and a fast path is taken.

    Why does anyone run bubble sort these days ??
    Use Merge sort or quicksort depending on properties of your data.

    Testing compilers with bubble sort is akin to taking a model-T to
    race against top fuel dragsters.

    I found that gcc-14.2 is significantly less aggressive in vectorizing
    than gcc-12.2, but still incurs the above-mentioned slowdown. But I
    only checked that later. First I wondered whether gcc-14.2 would
    still see a slowdown from auto-vectorization, and in which
    store-to-load forwarding cases it would happen. You can find the
    results at <https://www.complang.tuwien.ac.at/anton/stwlf/>.

    For those who want the gist:

    * Narrow (8-byte) completely overlapping store-to-load forwarding (all
    those cases we see in the -O code) is fast on Zen 3 and Zen 4 in all
    measured cases, and on the other microarchitectures in most measured
    cases.

    * Wide (16-byte) completely overlapping store-to-load forwarding (-O3
    code fdor the wl>ws=>wl case) is significantly slower on those
    machines where the non-vectorized counterpart is fast (Zen4, Zen3,
    Gracemont), but on a number of uarchs the non-vectorized counterpart
    has the same slowdown.

    This is an argument that all -|Architectures should perform as close
    to optimal as possible on a SINGLE code sequence. Compiler output
    code should be near-optimal on every implementation. That it is not
    has the fickle finger of fate pointing at ISA and -|Arch instead of
    compiler or source code.

    * Narrow-to-wide or partially overlapping wide-to-wide store-to-load
    forwarding is very slow and tends to become slower (in cycles) with
    newer generations. It is already slow if the dependency chain ends
    soon after the wide load, and cases involving recurrences tend to be
    even slower.

    This is a FUNDAMENTALLY HARD problem. It is (and will remain being)
    faster to never do this--it is the data-path equivalence to false
    sharing of cache line in ATOMIC events.

    Forwarding of full registers is already a BigO(n^2) problem
    Forwarding of aligned partial registers a BigO(n^3) problem
    Forwarding of random partial registers a BigO(n^4) problem

    It is NEVER going to be fast (if the rest of the machine is fast).

    * Wide-store-to-narrow-load forwarding is cheap.

    So it seems that unless the compiler has very good knowledge that the
    wide load was not preceded by a recent store to one of the involved addresses, it is better not to vectorize two narrow loads into a wide
    load.

    It is often wise not to vectorize a lot of stuff--especailly
    those things which are fundamentally hard.

    - anton
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Feb 5 15:26:22 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
    from gcc's auto-vectorization for the bubble-sort benchmark of John
    Hennessy's collection of small integer benchmarks.

    Have you considered how hard it is to track part of a register's
    contents {RATHER THAN "In Its Entirety"} ??

    This is not about registers. This is about memory. The performance
    hit for partial store-to-load forwarding seems to be even higher than
    for partial register stalls.

    However, the Pentium II seems to have reduced the penalty for
    partial-register stalls compared to the Pentium Pro. What did they
    do?

    Also, concerning partial-register stalls, the zero-extending semantics
    of AMD64 prevented that for 32-bit writes followed by 64-bit loads of
    GPRs. But the AVX and AVX-512 semantics merge a partial-register
    write (e.g., of an XMM register on AVX-256) with the current content.
    I think several microarchitectures have an optimization of that if the
    high bits are 0.

    in the next iteration of bubble-sort two 4-byte loads are
    auto-vectorized into an 8-byte load, but this load only partially
    overlaps the store. This results in taking a slow path in
    store-to-load forwarding. By contrast, without auto-vectorization the
    stores and the loads are 4-byte wide, store-to-load forwarding sees a
    full overlap, and a fast path is taken.

    Why does anyone run bubble sort these days ??

    Because John Hennessy <https://en.wikipedia.org/wiki/John_L._Hennessy>
    used it as a benchmark in the early 1980s, and it has been translated
    to Forth by Marty Fraeman a few years later, which gives us a
    benchmark where we can compare C compilers and Forth compilers. Why
    did Hennessy choose bubble-sort? Quicksort and mergesort were
    well-known at that time. He apparently collected small integer
    benchmarks that were available at the time, and apparently bubble-sort satisfied his requirements even though on would use other sorting
    algorithms unless the data has very specific properties.

    In this case bubble-sort highlights a particular performance weakness
    of a particular compiler on current microarchitectures, but I expect
    that similar things happen in other code, even if they do not happen
    frequently enough to stick out like a sore thumb as it happens for
    bubble-sort.

    E.g., in stack-based interpreters you will often have two writes to
    the stack followed by loads of two stack items. If gcc
    auto-vectorizes the two loads into a single wide load, the wide load
    will probably see similar slowdowns.

    Testing compilers with bubble sort is akin to taking a model-T to
    race against top fuel dragsters.

    Shouldn't a competent compiler compile bubble-sort to fast code, too?
    gcc -O certainly does. clang -O3 does, too. gcc -O3
    -fno-tree-vectorize does, too. But gcc -O3 falls into the partial store-to-load forwarding pitfall.

    * Wide (16-byte) completely overlapping store-to-load forwarding (-O3
    code fdor the wl>ws=>wl case) is significantly slower on those
    machines where the non-vectorized counterpart is fast (Zen4, Zen3,
    Gracemont), but on a number of uarchs the non-vectorized counterpart
    has the same slowdown.

    This is an argument that all -|Architectures should perform as close
    to optimal as possible on a SINGLE code sequence. Compiler output
    code should be near-optimal on every implementation. That it is not
    has the fickle finger of fate pointing at ISA and -|Arch instead of
    compiler or source code.

    What problem do you see in the ISA? How would you prevent it?

    Concerning the uArch, sure it would be cool if the heroic
    optimizations that make 8-byte fully overlapping store-to-load
    forwarding fast on Zen3 and Zen4 and, in many cases, also on other
    uArchs, would also apply to 16-byte fully overlapping store-to-load
    forwarding, but if you ask me whether I prefer Zen3/4 behaviour for
    the wl>ws=>wl case to the Rocket Lake case, I actually prefer the
    Zen3/4 behaviour:

    Zen 4 | Zen 3 |Rocket Lake|
    -O -O3 | -O -O3 | -O -O3 | dependencies
    3.09 8.80|3.18 8.87| 5.12 5.04|wl>ws=>wl (recurrence), nl>ns (no recurrence)

    Zen3/4 gives us a way to make this case very fast, and has no slow
    cases for narrow full-width forwarding, at least not in the cases
    looked at in this microbenchmark. By contrast, Rocket Lake has more
    cases where narrow store-to-load forwarding performs
    worse, but at least its wide-to-wide forwarding is not any worse.

    * Narrow-to-wide or partially overlapping wide-to-wide store-to-load
    forwarding is very slow and tends to become slower (in cycles) with
    newer generations. It is already slow if the dependency chain ends
    soon after the wide load, and cases involving recurrences tend to be
    even slower.

    This is a FUNDAMENTALLY HARD problem. It is (and will remain being)
    faster to never do this--it is the data-path equivalence to false
    sharing of cache line in ATOMIC events.

    Forwarding of full registers is already a BigO(n^2) problem
    Forwarding of aligned partial registers a BigO(n^3) problem
    Forwarding of random partial registers a BigO(n^4) problem

    It is NEVER going to be fast (if the rest of the machine is fast).

    How would one achieve similar benefits for narrow-to-wide forwarding
    or partial wide-to-wide forwarding. One possibility would be to split
    the wide load into several loads (but that has lots of possible cases:
    Consider two byte stores to non-adjacent addresses followed by a
    512-bit load that includes both of these byte store addresses),
    possibly using a history-based predictor for determining whether the
    load is going to overlap a recently-stored-to address range; there
    already is (or was in an Intel processor, at one time, around Haswell
    or so) such a predictor to allow alias detection for full-width
    store-to-load forwarding, maybe that could be enhanced for the partially-overlapping case. Still it seems hard to achieve that, and
    the benefit is probably small on most programs.

    OTOH, the compiler has a hard time knowing whether some recent store
    has been to overlapping memory, so a hardware solution would certainly
    be ideal, just as it has been in other cases where the compiler has insufficient knowledge (e.g., branch prediction).

    But given what should a compiler that targets present hardware do? I
    think it should be more reluctant in applying this kind of
    vectorization, which I have found to be called SLP (superword-level parallelism) and can be disabled in gcc with -ftree-slp-vectorize.

    It is often wise not to vectorize a lot of stuff--especailly
    those things which are fundamentally hard.

    Exactly.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 5 19:20:02 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 3317: '\xC2'" while decoding:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
    from gcc's auto-vectorization for the bubble-sort benchmark of John
    Hennessy's collection of small integer benchmarks.

    Have you considered how hard it is to track part of a register's
    contents {RATHER THAN "In Its Entirety"} ??

    This is not about registers. This is about memory. The performance
    hit for partial store-to-load forwarding seems to be even higher than
    for partial register stalls.

    Perhaps there is a rule in the memory model that individual portions
    of a cache line have to be updated (written) in program order, and
    the delay you see is the ST queue waiting for data to arrive.

    However, the Pentium II seems to have reduced the penalty for partial-register stalls compared to the Pentium Pro. What did they
    do?

    Went BigO(n^3) on forwarding.

    Also, concerning partial-register stalls, the zero-extending semantics
    of AMD64 prevented that for 32-bit writes followed by 64-bit loads of
    GPRs. But the AVX and AVX-512 semantics merge a partial-register
    write (e.g., of an XMM register on AVX-256) with the current content.
    I think several microarchitectures have an optimization of that if the
    high bits are 0.

    in the next iteration of bubble-sort two 4-byte loads are
    auto-vectorized into an 8-byte load, but this load only partially
    overlaps the store. This results in taking a slow path in
    store-to-load forwarding. By contrast, without auto-vectorization the
    stores and the loads are 4-byte wide, store-to-load forwarding sees a
    full overlap, and a fast path is taken.

    Why does anyone run bubble sort these days ??

    Because John Hennessy <https://en.wikipedia.org/wiki/John_L._Hennessy>
    used it as a benchmark in the early 1980s, and it has been translated
    to Forth by Marty Fraeman a few years later, which gives us a
    benchmark where we can compare C compilers and Forth compilers. Why
    did Hennessy choose bubble-sort? Quicksort and mergesort were
    well-known at that time. He apparently collected small integer
    benchmarks that were available at the time, and apparently bubble-sort satisfied his requirements even though on would use other sorting
    algorithms unless the data has very specific properties.

    In this case bubble-sort highlights a particular performance weakness
    of a particular compiler on current microarchitectures, but I expect
    that similar things happen in other code, even if they do not happen frequently enough to stick out like a sore thumb as it happens for bubble-sort.

    E.g., in stack-based interpreters you will often have two writes to
    the stack followed by loads of two stack items. If gcc
    auto-vectorizes the two loads into a single wide load, the wide load
    will probably see similar slowdowns.

    Here is where HW is a better auto-vectorizer than SW. HW can run the
    2 accesses to the stack as RAW hazards resolve a single wider read
    has to wait for both portions to be written. Thus, the data-flow
    dependencies are easier with throwing SIMD at the problem(s).

    Testing compilers with bubble sort is akin to taking a model-T to
    race against top fuel dragsters.

    Shouldn't a competent compiler compile bubble-sort to fast code, too?

    A fast BigO(n^3) problem can never compete with a mediocre BigO(n*ln(n)) problem.

    gcc -O certainly does. clang -O3 does, too. gcc -O3
    -fno-tree-vectorize does, too. But gcc -O3 falls into the partial store-to-load forwarding pitfall.

    What happens in the quicksort() compiled code ??

    * Wide (16-byte) completely overlapping store-to-load forwarding (-O3
    code fdor the wl>ws=>wl case) is significantly slower on those
    machines where the non-vectorized counterpart is fast (Zen4, Zen3,
    Gracemont), but on a number of uarchs the non-vectorized counterpart
    has the same slowdown.

    This is an argument that all |e-|Architectures should perform as close
    to optimal as possible on a SINGLE code sequence. Compiler output
    code should be near-optimal on every implementation. That it is not
    has the fickle finger of fate pointing at ISA and |e-|Arch instead of >compiler or source code.

    What problem do you see in the ISA?

    Too many additions over the decades make it impossible for the compiler
    to guess correctly all the time, and so many mostly-compatible additions
    have made it wearisome for the programmer to understand it all just so
    he can properly inform compiler of this idiosyncrasy here and that there.

    How would you prevent it?

    not use x86....

    Concerning the uArch, sure it would be cool if the heroic
    optimizations that make 8-byte fully overlapping store-to-load
    forwarding fast on Zen3 and Zen4 and, in many cases, also on other
    uArchs, would also apply to 16-byte fully overlapping store-to-load forwarding, but if you ask me whether I prefer Zen3/4 behaviour for
    the wl>ws=>wl case to the Rocket Lake case, I actually prefer the
    Zen3/4 behaviour:

    Zen 4 | Zen 3 |Rocket Lake|
    -O -O3 | -O -O3 | -O -O3 | dependencies
    3.09 8.80|3.18 8.87| 5.12 5.04|wl>ws=>wl (recurrence), nl>ns (no recurrence)

    Zen3/4 gives us a way to make this case very fast, and has no slow
    cases for narrow full-width forwarding, at least not in the cases
    looked at in this microbenchmark. By contrast, Rocket Lake has more
    cases where narrow store-to-load forwarding performs
    worse, but at least its wide-to-wide forwarding is not any worse.

    * Narrow-to-wide or partially overlapping wide-to-wide store-to-load
    forwarding is very slow and tends to become slower (in cycles) with
    newer generations. It is already slow if the dependency chain ends
    soon after the wide load, and cases involving recurrences tend to be
    even slower.

    This is a FUNDAMENTALLY HARD problem. It is (and will remain being)
    faster to never do this--it is the data-path equivalence to false
    sharing of cache line in ATOMIC events.

    Forwarding of full registers is already a BigO(n^2) problem
    Forwarding of aligned partial registers a BigO(n^3) problem
    Forwarding of random partial registers a BigO(n^4) problem

    It is NEVER going to be fast (if the rest of the machine is fast).

    How would one achieve similar benefits for narrow-to-wide forwarding
    or partial wide-to-wide forwarding.

    Remove it from the compiler, and make HW do something about it.

    It is often wise not to vectorize a lot of stuff--especailly
    those things which are fundamentally hard.

    Exactly.

    - anton
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jan 31 18:42:49 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
    from gcc's auto-vectorization for the bubble-sort benchmark of John
    Hennessy's collection of small integer benchmarks.

    What is the PR number?

    By now you should know that I consider gcc bug reports a waste of
    time. I last told you that in
    <2025Jul15.080403@mips.complang.tuwien.ac.at> and gave PR93811 as an
    example where I have wasted my time with creating a PR, and the status
    of this PR has not changed in the meantime.

    You seem to think that it is worthwhile creating gcc bug reports, so
    go ahead and create one yourself. I think the web page contains all information necessary, but if you miss something, let me know.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 1 13:18:17 2026
    From Newsgroup: comp.arch

    On Sun, 1 Feb 2026 09:16:27 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    I had above-zero success rate with gcc bug reports related to
    compiler pessimization. May be, 10%. May be even 15%. I didn't
    really count.

    Maybe some figures to put this into perspective.

    In 2025, 593 missed-optimization bugs were closed, most of them
    marked as fixed, 528 new ones were submitted. As of today, there
    are 3672 missed-optimization bugs open, so we are looking at arount
    a 6 year average turnover.

    97 missed-optimization regressions were submitted in 2025, with
    174 of them closed, with 318 missed-optimization regressions open
    right now, so it is more of a two-year average turnover (and
    there seems to be progress in reducing those).

    I chose 2025 because it is easy to search for; it does not
    correspond to a gcc release cycle.


    I very rarely submit "missed optimization" bugs.
    As far as I am concerned, missed optimization is not a bug, it is
    business as usual, with the hope of improvement in the future.
    My cases are nearly always "compiler tries too be too smart with
    horrible consequences" rather than "compiler is too stupid".

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 15:11:44 2026
    From Newsgroup: comp.arch

    On 2/4/26 8:48 PM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    I have seen big slowdowns (factor 5.7 on a Rocket Lake with gcc-14.2)
    from gcc's auto-vectorization for the bubble-sort benchmark of John
    Hennessy's collection of small integer benchmarks.

    Have you considered how hard it is to track part of a register's
    contents {RATHER THAN "In Its Entirety"} ??

    If half-register concatenation is not supported, I think one
    could avoid a significant amount of complexity. Even supporting
    concatenation only for stores might be similar to ARM's store
    pair support. If a RAT is used, one would double the number of
    entries. If full-sized values are allocated two distinct half-
    sized physical registers (to reduce physical size), they would
    be similar to paired register operations. If half-sized values
    are allocated to full-sized registers, a little simplification
    might be possible. Allocating and freeing twice as many physical
    registers per cycle seems rather painful.

    With full-sized registers, the lower of a pair of RAT entries
    could be used for full-sized and lower-half operands allowing
    concatenation errors to be detected when a full-sized operand's
    lower-half RAT entry does not match the upper half (or the upper
    half if not zero depending on whether one writes full values by
    replication or "zero extension").

    If the point is to provide more named values, concatenation does
    not seem to be important.

    Consider a string of x86 instructions that write to different bytes
    of the same register.
    a) do you want to blow-up the forwarding path by 4|u ??
    b) do you want each forwarding path-portion to select between
    4 places in any result ??

    If one limits supported operations to add/subtract and the
    bitwise logical operations, it seems that one could handle the
    operations that only use upper or lower halfs for sources and
    destinations without much extra logic. Operations with upper and
    lower sources and an upper destination might only need to
    convert the operations to a shift-by-32-and-operate of the lower
    operand and perhaps some of the time for carry propagation could
    be taken to cover the shift latency? Mixed inputs and lower
    output would not be able to hide extra shift latency; not
    providing such operations or having such have two-cycle latency
    to allow a post-calculation 32-bit shift might be acceptable.

    (Variable shift might be allowed if the variable is in a lower
    half.)

    My guess is no. Thus, quit using partial registers and get on with life.

    I have certainly not thought deeply about the costs and probably
    lack sufficient knowledge of hardware design to make reasonable
    estimates. Yet for z/ARchitecture IBM chose to support half-
    sized operations to increase the number of independent values
    that can be in registers, admittedly because of the encoding
    constraint of a legacy architecture. If I recall correctly, AMD
    also chose to support double-"native" width of SIMD by using
    twice as many operands, which is a purely microarchitectural
    choice prioritizing resources for "native" width while
    supporting instructions with twice the SIMD width.
    --- Synchronet 3.21b-Linux NewsLink 1.2