• IA-64 (was: Tonights Tradeoff)

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Feb 21 16:18:11 2026
    From Newsgroup: comp.arch

    jgd@cix.co.uk (John Dallman) writes:
    [some unattributed source writes:]
    In a way, it showed that they screwed up the design pretty hard
    that x86-64 ended up being the faster and more efficient option...

    They did. They really did.

    I guess one question is if they had any other particular drawbacks
    other than, say:
    Their code density was one of the worst around;
    128 registers is a little excessive;
    128 predicate register bits is a bit WTF;

    IA-64 certainly had significantly larger code sizes than others, but I
    think that they expected it and found it acceptable. And a
    software-pipelined loop on IA-64 is probably smaller than an
    auto-vectorized loop on AMD64+AVX, thanks to their predication
    features that allowed them to avoid extra code for the ramp-up and
    ramp-down of the software-pipelined loops, whereas auto-vectorized
    code usually requires ramp-down, plus quite a bit of other overhead.
    Clang seems to be significantly better at keeping such loops small
    than gcc.

    They
    had two much bigger problems, though.

    They'd correctly understood that the low speed of affordable dynamic RAM
    as compared to CPUs running at hundreds of MHz was the biggest barrier to >making code run fast.

    They may have "understood" it, as have many others who have read
    "Hitting the memory wall", but I have disputed in 2001 that this is
    correct in general:
    https://www.complang.tuwien.ac.at/anton/memory-wall.html

    [Looking at that text again, I see the claims "On current
    high-performance machines this time [for compulsory cache misses] is
    limited to a few seconds if the process stays in physical RAM" and
    "Memory bandwidth tends to grow with memory size, so this won't change
    much"; as it happens, I recently measured the memory bandwidth of my
    PC with a Ryzen 8700G and 64GB RAM (whereas my machine in 2001 had
    192MB): it is 64GB/s when read sequentially, i.e., it can read all its
    RAM in 1s. For random accesses, performance is significantly worse,
    however.]

    McKee has written a retrospective of the paper in 2004 <http://svmoore.pbworks.com/w/file/fetch/59055930/p162-mckee.pdf>, but
    has not cited my reaction (but she apparently had a wealth of
    reactions to select from, so that's excusable),

    Anyway, my point is that there are many programs that are not
    memory-bound, and my convenience benchmarks I usually use (a LaTeX run
    and Gforth's small benchmarks) are among them. Itanium II sucks on
    them compared to its contemporary (or even older) competition; numbers
    are times in seconds (lower means faster):

    LaTeX <https://www.complang.tuwien.ac.at/franz/latex-bench>
    - HP workstation 900MHz Itanium II, Debian Linux 3.528
    - Athlon (Thunderbird) 1200C, VIA KT133A, PC133 SDRAM, RedHat7.1 1.68
    - Pentium 4 2.66GHz, 512KB L2, Debian 3.1 1.31
    - Athlon 64 3200+, 2000MHz, 1MB L2, Fedora Core 1 (64-bit) 0.76

    Gforth <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/Benchres>:
    sieve bubble matrix fib fft release; CPU; gcc
    0.708 0.780 0.484 1.028 0.552 20250201 Itanium II 900MHz (HP rx2600); gcc-4.3.2
    0.192 0.276 0.108 0.360 0.7.0; K8 2GHz (Opteron 270); gcc-4.3.1

    Their solution was have the compiler schedule loads
    well in advance. They assumed, without evidence, that a compiler with
    plenty of time to think could schedule loads better than hardware doing
    it dynamically. It's an appealing idea, but it's wrong.

    Actually, other architectures also added prefetching instructions for
    dealing with that problem. All I have read about that was that there
    were many disappointments when using these instructions. I don't know
    if there were any successes, and how frequent they were compared to
    the disappointments. So I don't see that IA-64 was any different from
    other architectures in that respect.

    My explanations for the disappointments are:

    * Hardware prefetchers (typically stride-based) already cover a lot of
    the cases where prefetch instructions would otherise help. In that
    case memory-bound programs are limited by bandwidth, not by latency.

    * When dealing with pointer-chasing in linked lists and such, unless
    the linked list entries usually happen to be a fixed stride apart
    (where the stride predictor helps), the pointer-chasing limits the
    performance in any case; for an in-order uarch, prefetching or
    scheduling the fetching of the next pointer ahead may save a cycles,
    but typically few compared to the memory access latency; for OoO
    uarchs, OoO usually will result in a bunch of instructions (among
    them the load of the next pointer) waiting for the result of the RAM
    access).

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It might be possible to do that effectively in a single-core,
    single-thread, single-task system that isn't taking many (if any)
    interrupts. In a multi-core system, running a complex operating system, >several multi-threaded applications, and taking frequent interrupts and >context switches, it is _not possible_. There is no knowledge of any of
    the interrupts, context switches or other applications at compile time,
    so the compiler has no idea what is in cache and what isn't.

    I don't think that's a big issue. Sure, a context switch in one core
    results in the need to refill most of the L1 and L2 caches, and maybe
    even a part of the L3 cache, but that's the cost of the context
    switches, and the prefetching (whether by the compiler or the
    programmer) is usually not designed to mitigate that. That's because
    context switches are rare; e.g., when I run an 8-thread DGEMM (using
    libopenmp) for 16000x16000 matrices on my 8-core desktop system (with
    other stuff working (but consuming little CPU) at the same time, I see:

    # perf stat -e sched:sched_switch,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions openblas 16000

    Performance counter stats for 'openblas 16000':

    14811 sched:sched_switch # 82.470 /sec
    179592.12 msec task-clock # 7.444 CPUs utilized
    14811 context-switches # 82.470 /sec
    5 cpu-migrations # 0.028 /sec
    9789 page-faults # 54.507 /sec
    548255317659 cycles # 3.053 GHz
    936546607037 instructions # 1.71 insn per cycle

    24.125989128 seconds time elapsed

    173.502827000 seconds user
    6.016053000 seconds sys

    I.e., one context switch every 37M cycles (or 63M instructions).
    Refilling the 1MB L2 (Zen4) from RAM at 64GB/s costs about 47k cycles,
    i.e., less than 0.2% of the time between context switches.

    [You may wonder why the processor is only running at 3GHz; I have
    power-limited it to 25W; it tends to do the same task close to 5GHz in
    its default setting (88W power limit).]

    Speculative execution addresses that problem quite effectively.

    I have certainly read about interesting results for binary search (in
    an array) where the branching version outperforms the branchless
    version because the branching version speculatively executes several
    followup loads, and even in unpredictable cases the speculation should
    be right in 50% of the cases, resulting in an effective doubling of memory-level parallelism (but at a much higher cost in memory
    subsystem load). But other than that, I don't see that speculative
    execution helps with memory latency.

    OoO helps in several ways: it will do some work in the shadow of the
    load (although the utilization will still be abysmal even with
    present-day schedulers and ROBs [1]); but more importantly, it can
    dispatch additional loads that may also miss the cache, resulting in
    more memory-level parallelism. To some extent that also happens with
    in-order uarchs where the in-order aspect only takes effect when using
    the result of an instruction, but OoO provides more cases (especially
    if you do not know which loads will miss, which often is the case).

    [1] With 50ns DRAM latency (somewhat optimistic) and 5GHz CPU with 6
    rename slots per cycle, 1500 instructions could be executed (at its
    most extreme) while a load is served from DRAM, but the scheduling
    (and non-scheduler) slots tend to be ~100, and the ROB entries tend to
    be ~500.

    But I don't see that multi-core, multi-threaded, multi-tasking makes a significant difference here.

    We don't
    have a better way, almost thirty years after Itanium design decisions
    were taken. They didn't want to do speculative execution

    They wanted to do it (and did it) in the compiler; the corresponding architectural feature is IIRC the advanced load.

    and they close
    an instruction format and register set that made adding it later hard.

    The instruction format makes no difference. Having so many registers
    may have mad it harder than otherwise, but SPARC also used many
    registers, and we have seen OoO implementations (and discussed them a
    few months ago). The issue is that speculative execution and OoO
    makes all the EPIC features of IA-64 unnecessary, so if they cannot do
    a fast in-order implementation of IA-64 (and they could not), they
    should just give up and switch to an architecture without these
    features, such as AMD64. And Intel did, after a few years of denying.
    The remainder of IA-64 was just to keep it alive for the customers who
    had bought into it.

    The other problem was that they had three (or six, or twelve) in-order >pipelines running in parallel. That meant the compilers had to provide
    enough ILP to keep those pipelines fed, or they'd just eat cache capacity
    and memory bandwidth executing no-ops ...

    No, IA-64 has groups (sets of instructions that are allowed to start
    within the same cycle) down to one instruction. Many people think
    that the bundles (3 instructions encoded in 128 bits) are the same as
    groups, but that's not the case. There can be stops (boundaries
    between groups) within a bundle. The only thing the bundle format
    limits is that branch targets may only start at a 128-bit boundary.

    The IA-64 instruction format is still not particularly compact, but
    it's not as bad as you indicate.

    They didn't have a general way to extract enough ILP.

    For a few cases, they have. But the problem is that these cases are
    also vectorizable.

    Their major problem was that they did not get enough ILP from
    general-purpose code.

    And even where they had enough ILP, OoO CPUs with SIMD extensions ate
    their lunch.

    I guess it is more of an open question of what would have happened,
    say, if Intel had gone for an ISA design more like ARM64 or RISC-V
    or something.

    ARM64 seems to me to be the product of a lot more experience with >speculatively-executing processors than was available in 1998.

    OoO processors with speculative execution have been widely available
    since 1995 (Pentium Pro, PA 8000; actually PPC 603 preceded that in
    1994, but made less of a splash than the Pentium Pro, and the ES/9000
    H2 was available in 1991, but was confined to the IBM world; did they
    ever submit SPEC results for that?). And the experience is that EPIC architectural features (beyond SIMD) are unnecessary, and the result
    of that is indeed ARM64, but also AMD64 and RISC-V.

    RISC-V has
    not demonstrated really high performance yet, and it's been around long >enough that I'm starting to doubt it ever will.

    In a world where we see convergence on fewer and fewer architecture
    styles and on fewer and fewer architectures, you only see the
    investment necessary for high-performance implementations of a new
    architecture if there is a very good reason not to use one of the
    established architectures (for ARM T32 and ARM A64 the smartphone
    market was that reason). It may be that politics will provide that
    reason for another architecture, but even then it's hard. But RISC-V
    seems to have the most mindshare among the alternatives, so if any
    architecture will catch up, it looks like the best bet.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Feb 21 20:28:00 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    jgd@cix.co.uk (John Dallman) writes:
    [some unattributed source writes:]
    ----------------
    Actually, other architectures also added prefetching instructions for
    dealing with that problem. All I have read about that was that there
    were many disappointments when using these instructions. I don't know
    if there were any successes, and how frequent they were compared to
    the disappointments. So I don't see that IA-64 was any different from
    other architectures in that respect.

    If there had been any significant successes, you would have heard about them.

    -------------------------------
    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    Array and Matrix scientific codes with datasets bigger than cache.

    ----------------------------
    I have certainly read about interesting results for binary search (in
    an array) where the branching version outperforms the branchless
    version because the branching version speculatively executes several
    followup loads, and even in unpredictable cases the speculation should
    be right in 50% of the cases, resulting in an effective doubling of memory-level parallelism (but at a much higher cost in memory
    subsystem load). But other than that, I don't see that speculative
    execution helps with memory latency.

    At the cost of opening the core up to Spectr|--like attacks. ------------------------
    The instruction format makes no difference. Having so many registers
    may have mad it harder than otherwise, but SPARC also used many
    registers, and we have seen OoO implementations (and discussed them a
    few months ago). The issue is that speculative execution and OoO
    makes all the EPIC features of IA-64 unnecessary, so if they cannot do
    a fast in-order implementation of IA-64 (and they could not), they
    should just give up and switch to an architecture without these
    features, such as AMD64. And Intel did, after a few years of denying.
    The remainder of IA-64 was just to keep it alive for the customers who
    had bought into it.

    IA-64 was to prevent AMD (and others) from clones--that is all. Intel/HP
    would have had a patent wall 10 feet tall.

    It did not fail by being insufficiently protected, it failed because it
    did not perform.
    -----------------
    For a few cases, they have. But the problem is that these cases are
    also vectorizable.

    Their major problem was that they did not get enough ILP from
    general-purpose code.

    And even where they had enough ILP, OoO CPUs with SIMD extensions ate
    their lunch.

    And that should have been the end of the story. ... Sorry Ivan.


    - anton
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Feb 22 03:00:59 2026
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out.

    I can't remember seeing such arguments comping from compiler people, tho.

    Actually, the IA-64 people could point to the work on VLIW (in
    particular, Multiflow (trace scheduling) and Cydrome (software
    pipelining)), which in turn is based on the work on compilers for
    microcode.

    I knew the Multiflow people pretty well when I was at Yale. Trace
    scheduling was inspired by the FPS AP-120B, which had wide
    instructions issuing multiple operations and was extremely hard to
    program efficiently.

    Multiflow's compiler worked pretty well and did a good job of static
    scheduling memory operations when the access patterns weren't too data dependent. It was good enough that Intel and HP both licensed it and
    used it in their VLIW projects. It was a good match for the hardware
    in the 1980s.

    But as computer hardware got faster and denser, it became possible to
    do the scheduling on the fly in hardware, so you could get comparable performance with conventional instruction sets in a microprocessor.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Feb 22 09:16:00 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    But as computer hardware got faster and denser, it became possible to
    do the scheduling on the fly in hardware, so you could get comparable >performance with conventional instruction sets in a microprocessor.

    Actually, OoO microprocessors appeared before IA-64 implementations
    were originally planned to be released, and were implemented in larger processes, i.e., they consumed fewer hardware resources.

    The Pentium Pro was implemented with 5.5M transistors in a 0.35um
    process (with 8+8KB L1 cache) and a die area of 306mm^2 (probably
    including the separate L2 cache chip). Later Intel released the
    Klamath Pentium II also in 0.35um, but with 16+16KB L1, with 7.5M
    transistors and a die size of 203mm^2 (the die should be larger than
    the CPU die of the Pentium Pro, that's why I think that the Pentium
    Pro number includes the L2 cache die); die size numbers from https://pc.watch.impress.co.jp/docs/2008/1027/kaigai_5.pdf

    The PA-8000 is a 4-wide OoO CPU implemented with 3.8M transistors in a
    0.5um process in 337.69mm^2. It has all caches off-chip.

    The Merced Itanium and McKinley Itanium II were 6-wide and implemented
    in 180nm, the same feature size as the Willamette Pentium 4 and
    Thunderbird Athlon. The Merced is reported as having 25.4M
    transistors (with 16+16KB L1 and 96KB of L2 cache plus 295M
    transistors for 4MB external L3 cache). The McKinley is reported as
    having a die size of 421mm^2 and a transistor count of 221M (with
    16+16KB L1, 256KB L2 and 3MB L3).

    Looking at <https://en.wikipedia.org/wiki/Itanium#Design_and_delays:_1994%E2%80%932001>,
    I read:

    |When Merced was floorplanned for the first time in mid-1996, it turned
    |out to be far too large [...]. The designers had to reduce the
    |complexity (and thus performance) of subsystems, including the x86
    |unit and cutting the L2 cache to 96 KB.[d] Eventually it was agreed
    |that the size target could only be reached by using the 180 nm process |instead of the intended 250 nm.

    For comparison, in the same 0.18um process the Willamette included
    8+8KB L1 and 256KB L2 cache in 217mm^2, and in the same-sized process
    the AMD Thunderbird and Palomino included 64+64KB L1 and 256KB L2
    cache.

    Unfortunately, the caches dominate the transistor counts, so one
    cannot tell how many transistors were needed for implementing the data
    path and cotrol stuff.

    We do have in-order CPUs such as the 4-wide 21164: 9.3M transistors
    (with 8+8KB L1 and 96KB L2), 299mm^2 in a 0.5um process.

    So the OoOness of the PA-8000 may have cost around as much area as the
    caches of the 21164 (and the higher clock rate of the 21164 compared
    to the PA-8000 and the Pentium Pro supported the theory that OoO is
    inherently slower).

    Comparing the 21164 to the Merced, the L2 cache sizes are the same and
    the L1 size of Merced is twice that of the 21164, yet the Merced takes
    2.7x the number of transistors of the 21164, and probably a lot of the additional transistors are not for the additional L1 caches. It seems
    that the architectural features and/or maybe the 6-wide implementation
    of the Merced cost a lot of transistors and thus die area, whereas a
    sales pitch for EPIC was that thanks to the explicit grouping of
    instructions, the supposedly quadratic cost of checking for register-to-register dependences would be eliminated, resulting in
    more area for additional functional units.

    Bottom line: If EPIC is easier to fit on a microprocessor, there is no
    evidence for that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Feb 22 13:17:30 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    [1) stride-based. 2) pointer-chasing]
    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    Array and Matrix scientific codes with datasets bigger than cache.

    The dense cases are covered by stride-based hardware predictors, so
    they are not "otherwise". I am not familiar enough with sparse
    scientific codes to comment on whether they are 1), 2), or
    "otherwise".

    I have certainly read about interesting results for binary search (in
    an array) where the branching version outperforms the branchless
    version because the branching version speculatively executes several
    followup loads, and even in unpredictable cases the speculation should
    be right in 50% of the cases, resulting in an effective doubling of
    memory-level parallelism (but at a much higher cost in memory
    subsystem load). But other than that, I don't see that speculative
    execution helps with memory latency.

    At the cost of opening the core up to Spectr|--like attacks.

    There may be a way to avoid the side channel while still supporting
    this scenario. But I think that there are better ways to speed up
    such a binary search:

    Here software prefetching can really help: prefetch one level ahead (2 prefetches), or two levels ahead (4 prefetches), three levels (8
    prefetches), or four levels (16 prefetches), etc., whatever gives the
    best performance (which may be hardware-dependent). The result is a
    speedup of the binary search by (in the limit) levels+1.

    By contrast, the branch prediction "prefetching" provides a factor 1.5
    at twice the number of loads when one branch is predicted, 1.75 at 3x
    the number of loads when two branches are predicted, etc. up to a
    speedup factor of 2 with an infinite of loads and predicted branches;
    that's for completely unpredictable lookups, with some predictability
    the branch prediction approach performs better, and with good
    predictability it should outdo the software-prefetching approach for
    the same number of additional memory accesses.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Feb 22 17:05:23 2026
    From Newsgroup: comp.arch

    On Sun, 22 Feb 2026 13:17:30 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    [1) stride-based. 2) pointer-chasing]
    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come
    to mind, but are they really common, apart from programming
    classes?

    Array and Matrix scientific codes with datasets bigger than cache.

    The dense cases are covered by stride-based hardware predictors, so
    they are not "otherwise". I am not familiar enough with sparse
    scientific codes to comment on whether they are 1), 2), or
    "otherwise".

    BLAS Level 3 is not particularly external/LLC bandwidth intensive even
    without hardware predictors. Overwhelming majority of data served from
    L2 cache.
    That's with classic SIMD. It's possible that with AMX units it's no
    longer true.
    I have certainly read about interesting results for binary search
    (in an array) where the branching version outperforms the
    branchless version because the branching version speculatively
    executes several followup loads, and even in unpredictable cases
    the speculation should be right in 50% of the cases, resulting in
    an effective doubling of memory-level parallelism (but at a much
    higher cost in memory subsystem load). But other than that, I
    don't see that speculative execution helps with memory latency.

    At the cost of opening the core up to Spectr_--like attacks.

    There may be a way to avoid the side channel while still supporting
    this scenario. But I think that there are better ways to speed up
    such a binary search:

    Here software prefetching can really help: prefetch one level ahead (2 prefetches), or two levels ahead (4 prefetches), three levels (8
    prefetches), or four levels (16 prefetches), etc., whatever gives the
    best performance (which may be hardware-dependent). The result is a
    speedup of the binary search by (in the limit) levels+1.

    By contrast, the branch prediction "prefetching" provides a factor 1.5
    at twice the number of loads when one branch is predicted, 1.75 at 3x
    the number of loads when two branches are predicted, etc. up to a
    speedup factor of 2 with an infinite of loads and predicted branches;
    that's for completely unpredictable lookups, with some predictability
    the branch prediction approach performs better, and with good
    predictability it should outdo the software-prefetching approach for
    the same number of additional memory accesses.

    - anton
    The recent comparisons of branchy vs branchless binary search that we
    carried on RWT forum seems to suggest that on modern CPUs branchless
    variant is faster even when the table does not fit in LLC.
    Branchy variant manages to pull ahead only when TLB misses can't be
    served from L2$.
    At least that how I interpreted it.
    Here is result on very modern CPU: https://www.realworldtech.com/forum/?threadid=223776&curpostid=223974
    And here is older gear: https://www.realworldtech.com/forum/?threadid=223776&curpostid=223895
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Feb 22 19:20:51 2026
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out.

    I can't remember seeing such arguments comping from compiler people, tho.

    Actually, the IA-64 people could point to the work on VLIW (in
    particular, Multiflow (trace scheduling) and Cydrome (software >pipelining)), which in turn is based on the work on compilers for >microcode.

    I knew the Multiflow people pretty well when I was at Yale. Trace
    scheduling was inspired by the FPS AP-120B, which had wide
    instructions issuing multiple operations and was extremely hard to
    program efficiently.

    Multiflow's compiler worked pretty well and did a good job of static scheduling memory operations when the access patterns weren't too data dependent. It was good enough that Intel and HP both licensed it and
    used it in their VLIW projects. It was a good match for the hardware
    in the 1980s.

    But as computer hardware got faster and denser, it became possible to
    do the scheduling on the fly in hardware, so you could get comparable performance with conventional instruction sets in a microprocessor.

    Not comparable; superior.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Feb 22 20:14:10 2026
    From Newsgroup: comp.arch

    According to Stefan Monnier <monnier@iro.umontreal.ca>:
    particular, Multiflow (trace scheduling) and Cydrome (software
    pipelining)), which in turn is based on the work on compilers for
    microcode.

    Of course, compiler people have worked on such problems and solved some >cases. But what I wrote above is that "I can't remember seeing
    ... compiler people" claiming that "{sooner or later} compilers COULD
    figure stuff like this out".

    I recall Multiflow people telling me that trace scheduling did a great job
    of scheduling memory accesses when the patterns were predictable and that it didn't when they weren't, i.e., data dependent.

    Apropos another thread I can believe that IA-64 was obsolete before it was shipped
    for that reason, static scheduling will never keep up with dynamic except in applications where the access patterns are predictable.

    Are there enough applications like that to make VLIWs worth it? Some kinds of DSP?
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Feb 22 23:08:05 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    Apropos another thread I can believe that IA-64 was obsolete before it was shipped
    for that reason, static scheduling will never keep up with dynamic except in >applications where the access patterns are predictable.

    Concerning the scheduling, hardware scheduling looked pretty dumb at
    the time (always schedule the oldest ready instruction(s)), and
    compilers could pick the instructions on the critical path in their
    scheduling, but given the scheduling barriers in compilers (e.g.,
    calls and returns), and the window sizes in current hardware, even
    dumb is superior to smart.

    Another aspect where hardware is far superior is branch prediction.

    Are there enough applications like that to make VLIWs worth it? Some kinds of DSP?

    There have certainly been DSPs from TI (C60 series IIRC), and Phillips (TriMedia) that have VLIW architectures, so at least for a while, VLIW
    was competetive there.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Mon Feb 23 01:32:52 2026
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    Apropos another thread I can believe that IA-64 was obsolete before it was shipped
    for that reason, static scheduling will never keep up with dynamic except in >>applications where the access patterns are predictable.

    Concerning the scheduling, hardware scheduling looked pretty dumb at
    the time (always schedule the oldest ready instruction(s)), ...

    I was thinking of memory scheduling. You have multiple banks of memory
    each of which can only do one fetch or store at a time, and the goal
    was to keep all of the banks as busy as possible. If you're accessing
    an array in predictable order, trace scheduling works well, but if
    you're fetching a[b[i]] where b varies at runtime, it doesn't.

    Another aspect where hardware is far superior is branch prediction.

    I gather speculative execution of both branch paths worked OK if the
    branch tree wasn't too bushy. There were certainly ugly details, e.g.,
    if there's a trap on a path that turns out not to be taken.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Feb 23 06:55:08 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    Apropos another thread I can believe that IA-64 was obsolete before it was shipped
    for that reason, static scheduling will never keep up with dynamic except in >>>applications where the access patterns are predictable.

    Concerning the scheduling, hardware scheduling looked pretty dumb at
    the time (always schedule the oldest ready instruction(s)), ...

    I was thinking of memory scheduling. You have multiple banks of memory
    each of which can only do one fetch or store at a time, and the goal
    was to keep all of the banks as busy as possible. If you're accessing
    an array in predictable order, trace scheduling works well, but if
    you're fetching a[b[i]] where b varies at runtime, it doesn't.

    More advanced in-order uarchs have dealt with this by submitting
    requests to the load-store unit in-order, performing the requests as
    the resources and the memory model allow, and only letting uses of the
    loads wait for the results (Mitch Alsup has been calling such things
    OoO, too, but that's far from the modern OoO with speculative
    execution and in-order completion; in particular, there is no
    speculative execution).

    Plus, there is the option of inserting prefetch instructions, which
    give the same memory-level parallelism on in-order CPUs as on OoO
    CPUs; for a[b[i]] patterns they might be useful, unless the hardware prefetchers also know how to prefetch that.

    Moreover, AVX-512 has a gathering load instruction (and a scattering
    store instruction) that were designed for an optimized in-order
    implementation in Knights Ferry.

    Another aspect where hardware is far superior is branch prediction.

    I gather speculative execution of both branch paths worked OK if the
    branch tree wasn't too bushy.

    If one goes that way, if-conversion (conversion to predicated
    execution) looked like the way to go, but it turns a control
    dependency (which vanishes if the branch is predicted correctly) into
    a data dependency. Even without that, pulling instructions from all
    ways up across branches soon runs into resource limitations. Static
    branch prediction is not great (~20% mispredicts when using
    heuristics, ~10% when using profile feedback), but it is still far
    better than assuming that all branches go both ways equally likely.

    Augustus K. Uht suggested that one keeps track of the likelyhood of
    statically not-predicted branches; after following a few prediction,
    the likelyhood of the path to the currently predicted branch would be
    smaller than the likelyhood of one of the earlier not-predicted
    branches. He suggested that the compiler should use these likelyhoods
    to guide speculation decisions.

    But in the early 1990s dynamic (hardware) branch prediction started
    producing significantly lower misprediction rates than static and even semi-static branch prediction, and branch predictors have improved
    since then; e.g., for our LaTeX benchmark Zen 4 produces the following
    results:

    1_325_396_218 cycles 4.961 GHz ( +- 0.12% )
    3_565_310_588 instructions 2.69 insn per cycle
    656_903_470 branches 2.459 G/sec ( +- 0.01% )
    8_417_229 branch-misses 1.28% of all branches ( +- 0.21% )

    That's not just 1.28% branch mispredictions, but also 2.36 branch mispredictions per 1000 instruction (MPKI), or 423 instructions
    between branch mispredictions on average. This means that the
    hardware scheduler can schedule across hundreds of instructions before
    hitting a scheduling barrier. It also means that all the speculation
    within those 423 instructions will actually be useful, whereas
    speculating on both sides of a branch (or if-conversion) guarantees
    that most of the speculated instructions will be useless.

    But yes, for instructions from behind an if that do not depend on the
    values computed in the if, one can schedule them above the if without duplication along both paths, and this looked like one of the
    smartness advantages of compilers over OoO hardware. But given the
    very high branch prediction accuracies that hardware exhibits, this
    advantage is small, and as IA-64 demonstrated, does not outweigh the disadvantages of EPIC by far.

    There were certainly ugly details, e.g.,
    if there's a trap on a path that turns out not to be taken.

    IA-64 (and EPIC in general) has the advanced load for that; IA-64 also
    does not implement division in hardware (i.e., division by zero is
    checked by software). This means that any trapping can happen once
    the branch is guaranteed to be taken, but the speculative execution
    happens earlier. If static branch prediction worked as well as
    dynamic branch prediction, that would have been one of the important
    parts in making IA-64 have as much ILP as OoO.

    I have actually written a paper on the topic:

    @InProceedings{ertl&krall94,
    author = "M. Anton Ertl and Andreas Krall",
    title = "Delayed Exceptions --- Speculative Execution of
    Trapping Instructions",
    booktitle = "Compiler Construction (CC '94)",
    year = "1994",
    publisher = "Springer LNCS~786",
    address = "Edinburgh",
    month = "April",
    pages = "158--171",
    url = "https://www.complang.tuwien.ac.at/papers/ertl-krall94cc.ps.gz",
    abstract = "Superscalar processors, which execute basic blocks
    sequentially, cannot use much instruction level
    parallelism. Speculative execution has been proposed
    to execute basic blocks in parallel. A pure software
    approach suffers from low performance, because
    exception-generating instructions cannot be executed
    speculatively. We propose delayed exceptions, a
    combination of hardware and compiler extensions that
    can provide high performance and correct exception
    handling in compiler-based speculative execution.
    Delayed exceptions exploit the fact that exceptions
    are rare. The compiler assumes the typical case (no
    exceptions), schedules the code accordingly, and
    inserts run-time checks and fix-up code that ensure
    correct execution when exceptions do happen."
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Feb 23 08:06:20 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 22 Feb 2026 13:17:30 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted: =20
    [1) stride-based. 2) pointer-chasing]
    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come
    to mind, but are they really common, apart from programming
    classes? =20

    Array and Matrix scientific codes with datasets bigger than cache. =20 >>=20
    The dense cases are covered by stride-based hardware predictors, so
    they are not "otherwise". I am not familiar enough with sparse
    scientific codes to comment on whether they are 1), 2), or
    "otherwise".
    =20

    BLAS Level 3 is not particularly external/LLC bandwidth intensive even >without hardware predictors.

    There are HPC applications that are bandwidth limited; that's why they
    have the roofline performance model (e.g., <https://docs.nersc.gov/tools/performance/roofline/>). Of course for
    the memory-bound applications the uarch style is not particularly
    important, as long as it allows to exploit the memory-level
    parallelism in some way; prefetching (by hardware or software) is a
    way that can be performed by any kind of uarch to exploit MLP.

    IA-64 implementations have performed relatively well for SPECfp. I
    don't know how memory-bound these applications were, though.

    For those applications that benefit from caches (e.g., matrix
    multiplication), memory-level parallelism is less important.

    Overwhelming majority of data served from
    L2 cache.
    That's with classic SIMD. It's possible that with AMX units it's no
    longer true.

    I very much doubt that. There is little point in adding an
    instruction that slows down execution by turning it from compute-bound
    to memory-bound.

    The recent comparisons of branchy vs branchless binary search that we
    carried on RWT forum seems to suggest that on modern CPUs branchless
    variant is faster even when the table does not fit in LLC.=20

    Only two explanations come to my mind:

    1) The M3 has a hardware prefetcher that recognizes the pattern of a
    binary array search and prefetches accordingly. The cache misses from
    page table accesses might confuse the prefetcher, leading to worse
    performance eventually.

    2) (doubtful) The compiler recognizes the algorithm and inserts
    software prefetch instructions.

    Branchy variant manages to pull ahead only when TLB misses can't be
    served from L2$.
    At least that how I interpreted it.

    Here is result on very modern CPU: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223974

    And here is older gear: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223895

    I tried to run your code <https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
    on Zen4, but clang-14 converts uut2.c into branchy code. I could not
    get gcc-12 to produce branchless code from slightly adapted source
    code, either. My own attempt of using extended asm did not pass your
    sanity checks, so eventually I used the assembly code produced by
    clang-19 through godbolt. Here I see that branchy is faster for the
    100M array size on Zen4 (i.e., where on the M3 branchless is faster):

    [~/binary-search:165562] perf stat branchless 100000000 10000000 10
    4041.093082 msec. 0.404109 usec/point
    Performance counter stats for 'branchless 100000000 10000000 10':

    45436.88 msec task-clock 1.000 CPUs utilized
    276 context-switches 6.074 /sec
    0 cpu-migrations 0.000 /sec
    1307 page-faults 28.765 /sec 226091936865 cycles 4.976 GHz
    1171349822 stalled-cycles-frontend 0.52% frontend cycles idle
    34666912723 instructions 0.15 insn per cycle
    0.03 stalled cycles per insn
    5829461905 branches 128.298 M/sec
    45963810 branch-misses 0.79% of all branches

    45.439541729 seconds time elapsed

    45.397207000 seconds user
    0.040008000 seconds sys


    [~/binary-search:165563] perf stat branchy 100000000 10000000 10
    3051.269998 msec. 0.305127 usec/point
    Performance counter stats for 'branchy 100000000 10000000 10':

    34308.48 msec task-clock 1.000 CPUs utilized
    229 context-switches 6.675 /sec
    0 cpu-migrations 0.000 /sec
    1307 page-faults 38.096 /sec 172472673652 cycles 5.027 GHz
    49176146462 stalled-cycles-frontend 28.51% frontend cycles idle
    33346311766 instructions 0.19 insn per cycle
    1.47 stalled cycles per insn
    10322173655 branches 300.864 M/sec
    1583842955 branch-misses 15.34% of all branches

    34.311432848 seconds time elapsed

    34.264437000 seconds user
    0.044005000 seconds sys

    If you have recommendations what to use for the other parameters, I
    can run other sizes as well.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Feb 23 13:03:26 2026
    From Newsgroup: comp.arch

    On Mon, 23 Feb 2026 08:06:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 22 Feb 2026 13:17:30 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted: =20
    [1) stride-based. 2) pointer-chasing]
    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays
    come to mind, but are they really common, apart from programming
    classes? =20

    Array and Matrix scientific codes with datasets bigger than
    cache. =20
    =20
    The dense cases are covered by stride-based hardware predictors, so
    they are not "otherwise". I am not familiar enough with sparse
    scientific codes to comment on whether they are 1), 2), or
    "otherwise".
    =20

    BLAS Level 3 is not particularly external/LLC bandwidth intensive
    even without hardware predictors.

    There are HPC applications that are bandwidth limited; that's why they
    have the roofline performance model (e.g., <https://docs.nersc.gov/tools/performance/roofline/>).

    Sure. But that's not what I would call "dense".
    In my vocabulary "dense" starts at matmul(200x200,x200) or at LU
    decomposition of matrix of similar dimensions.
    I don't consider anything in BLAS Level 2 as "dense".
    May be, my definitions of terms are unusual.

    Of course for
    the memory-bound applications the uarch style is not particularly
    important, as long as it allows to exploit the memory-level
    parallelism in some way; prefetching (by hardware or software) is a
    way that can be performed by any kind of uarch to exploit MLP.

    IA-64 implementations have performed relatively well for SPECfp. I
    don't know how memory-bound these applications were, though.


    SPECfp2006 was mostly not memory-bound on CPUs of that era.
    OTOH, SPECfp_rate2006 for n>4 was memory-bound rather heavily.

    For those applications that benefit from caches (e.g., matrix multiplication), memory-level parallelism is less important.

    Overwhelming majority of data served from
    L2 cache.
    That's with classic SIMD. It's possible that with AMX units it's no
    longer true.

    I very much doubt that. There is little point in adding an
    instruction that slows down execution by turning it from compute-bound
    to memory-bound.


    It does not slow down the execution. To the contrary, it speeds it up
    so much that speed of handling of L2 misses begins to matter.
    Pay attention that it's just my speculation that can be wrong.
    IIRC, right now binary64-capable AMX is available only on Apple Silicon
    (via SME) and may be on IBM Z. I didn't play with either.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Feb 23 13:44:05 2026
    From Newsgroup: comp.arch

    On Mon, 23 Feb 2026 08:06:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    The recent comparisons of branchy vs branchless binary search that we >carried on RWT forum seems to suggest that on modern CPUs branchless >variant is faster even when the table does not fit in LLC.=20

    Only two explanations come to my mind:

    1) The M3 has a hardware prefetcher that recognizes the pattern of a
    binary array search and prefetches accordingly. The cache misses from
    page table accesses might confuse the prefetcher, leading to worse performance eventually.


    Coffee Lake certainly has no such prefetcher and nevertheless exhibits
    similar behavior.

    2) (doubtful) The compiler recognizes the algorithm and inserts
    software prefetch instructions.

    Branchy variant manages to pull ahead only when TLB misses can't be
    served from L2$.
    At least that how I interpreted it.

    Here is result on very modern CPU: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223974

    And here is older gear: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223895


    I tried to run your code <https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
    on Zen4, but clang-14 converts uut2.c into branchy code. I could not
    get gcc-12 to produce branchless code from slightly adapted source
    code, either. My own attempt of using extended asm did not pass your
    sanity checks, so eventually I used the assembly code produced by
    clang-19 through godbolt.

    I suppose that you have good reason for avoiding installation of clang17
    or later on one of your computers.

    Here I see that branchy is faster for the
    100M array size on Zen4 (i.e., where on the M3 branchless is faster):

    [~/binary-search:165562] perf stat branchless 100000000 10000000 10 4041.093082 msec. 0.404109 usec/point
    Performance counter stats for 'branchless 100000000 10000000 10':

    45436.88 msec task-clock 1.000 CPUs utilized
    276 context-switches 6.074 /sec
    0 cpu-migrations 0.000 /sec
    1307 page-faults 28.765 /sec 226091936865 cycles 4.976 GHz
    1171349822 stalled-cycles-frontend 0.52% frontend cycles idle 34666912723 instructions 0.15 insn per cycle
    0.03 stalled cycles per insn
    5829461905 branches 128.298 M/sec
    45963810 branch-misses 0.79% of all branches


    45.439541729 seconds time elapsed

    45.397207000 seconds user
    0.040008000 seconds sys


    [~/binary-search:165563] perf stat branchy 100000000 10000000 10
    3051.269998 msec. 0.305127 usec/point
    Performance counter stats for 'branchy 100000000 10000000 10':

    34308.48 msec task-clock 1.000 CPUs utilized
    229 context-switches 6.675 /sec
    0 cpu-migrations 0.000 /sec
    1307 page-faults 38.096 /sec 172472673652 cycles 5.027 GHz
    49176146462 stalled-cycles-frontend 28.51% frontend cycles idle 33346311766 instructions 0.19 insn per cycle
    1.47 stalled cycles per insn
    10322173655 branches 300.864 M/sec
    1583842955 branch-misses 15.34% of all branches


    34.311432848 seconds time elapsed

    34.264437000 seconds user
    0.044005000 seconds sys

    If you have recommendations what to use for the other parameters, I
    can run other sizes as well.

    - anton


    Run for every size from 100K to 2G in increments of x sqrt(2).
    BTW, I prefer odd number of iterations.
    The point at which majority of look-ups misses L3$ at least once
    hopefully will be seen as change of slop on log(N) vs duration
    graph for branchless variant.
    I would not bother with performance counters. At least for me they
    bring more confusion than insite.

    According to my understanding, on Zen4 the ratio of main DRAM
    latency to L3 latency is much higher than on either Coffee Lake or M3,
    both of which have unified LLC instead of split L3$.
    So, if on Zen2 branchy starts to win at ~2x L3 size, I will not be
    shocked. But I will be somewhat surprised.

    I actually have access to Zen3-based EPYC, where the above mentioned
    ratio is supposed much bigger than on any competent client CPU (Intel's Lunar/Arrow Lake do not belong to this category) but this server is
    currently powered down and it's a bit of a hassle to turn it on.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Feb 23 10:14:00 2026
    From Newsgroup: comp.arch

    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth bound. An example of each -

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems. For example, imagine a company employee file (table),
    with a (say 300 byte) record for each of its many thousands of employees
    each containing typical employee stuff). Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales"
    department. With no index on "department", but it is at a fixed
    displacement within each record, the code looks at each record, does a
    trivial test on it, perhaps adds to a register, then goes to the next
    record. This it almost certainly memory latency bound.

    For a memory bandwidth bound example, consider comparing two large text documents for equality.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Mon Feb 23 21:22:00 2026
    From Newsgroup: comp.arch

    In article <10ngao4$d5o$1@gal.iecc.com>, johnl@taugh.com (John Levine)
    wrote:

    I gather speculative execution of both branch paths worked OK if the
    branch tree wasn't too bushy. There were certainly ugly details,
    e.g., if there's a trap on a path that turns out not to be taken.

    Found a good CPU bug like that on an old AMD chip, the K6-II.

    It happened with a floating point divide by zero in the x87 registers,
    guarded by a test for division by zero, with floating-point traps enabled.
    The divide got speculatively executed, the trap was stored, the test
    revealed the divide would be by zero, the CPU tried to clean up, hit its
    bug, and just stopped. Power switch time.

    This only happened with the reverse divide instruction, which took the
    operands off the x87 stack in the opposite order from the usual FDIV. It
    was rarely used, so the bug didn't become widely known. But Microsoft's compiler used it occasionally.

    John
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Feb 23 21:33:51 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    IA-64 was to prevent AMD (and others) from clones--that is all. Intel/HP would have had a patent wall 10 feet tall.

    It did not fail by being insufficiently protected, it failed because it
    did not perform.

    What did they (try to) patent?

    -----------------
    For a few cases, they have. But the problem is that these cases are
    also vectorizable.

    Their major problem was that they did not get enough ILP from
    general-purpose code.

    And even where they had enough ILP, OoO CPUs with SIMD extensions ate
    their lunch.

    And that should have been the end of the story. ... Sorry Ivan.

    Things have been very quiet there. The last post in their
    forum talks about the Mill trying to get to the same level as
    Intel/AMD with Coremark, and they say they are going to go to
    other microbenchmarks, where the Mill is supposed to be better.

    I'm not sure...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Feb 24 10:41:58 2026
    From Newsgroup: comp.arch

    John Dallman wrote:
    In article <10ngao4$d5o$1@gal.iecc.com>, johnl@taugh.com (John Levine)
    wrote:

    I gather speculative execution of both branch paths worked OK if the
    branch tree wasn't too bushy. There were certainly ugly details,
    e.g., if there's a trap on a path that turns out not to be taken.

    Found a good CPU bug like that on an old AMD chip, the K6-II.

    It happened with a floating point divide by zero in the x87 registers, guarded by a test for division by zero, with floating-point traps enabled. The divide got speculatively executed, the trap was stored, the test
    revealed the divide would be by zero, the CPU tried to clean up, hit its
    bug, and just stopped. Power switch time.

    This only happened with the reverse divide instruction, which took the operands off the x87 stack in the opposite order from the usual FDIV. It
    was rarely used, so the bug didn't become widely known. But Microsoft's compiler used it occasionally.

    Still an impressive find!

    I can see that since it would be (almost?) 100% reproducible, you could
    bisect the executable (in a debugger) to hone in on where it froze?

    Trying to single-step up to the crash would negate the required
    speculation, right?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Feb 24 09:50:43 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    The recent comparisons of branchy vs branchless binary search that we
    carried on RWT forum seems to suggest that on modern CPUs branchless
    variant is faster even when the table does not fit in LLC.=20

    Only two explanations come to my mind:

    1) The M3 has a hardware prefetcher that recognizes the pattern of a
    binary array search and prefetches accordingly. The cache misses from
    page table accesses might confuse the prefetcher, leading to worse
    performance eventually.


    Coffee Lake certainly has no such prefetcher and nevertheless exhibits >similar behavior.

    I have now looked more closely, and the npoints parameter has a
    significant influence. If it is small enough, branchless fits into
    caches while branchy does not (see below), which might be one
    explanation for the results. Given that I have not seen npoints
    specified in any of the postings that mention branchless winning for
    veclen sizes exceeding the L3 cache size, it could be this effect. Or
    maybe some other effect. Without proper parameters one cannot
    reproduce the experiment to investigate it.

    I tried to run your code
    <https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
    on Zen4, but clang-14 converts uut2.c into branchy code. I could not
    get gcc-12 to produce branchless code from slightly adapted source
    code, either. My own attempt of using extended asm did not pass your
    sanity checks, so eventually I used the assembly code produced by
    clang-19 through godbolt.

    I suppose that you have good reason for avoiding installation of clang17
    or later on one of your computers.

    Sure. It costs time.

    If you have recommendations what to use for the other parameters, I
    can run other sizes as well.

    - anton


    Run for every size from 100K to 2G in increments of x sqrt(2).

    There is no integer n where 2G=100k*sqrt(2)^n. So I used the numbers
    shown below. You did not give any indication on npoints, so I
    investigated myself, and found that branchless will miss the L3 cache
    with npoints=100000, so I used that and used reps=200.

    BTW, I prefer odd number of iterations.

    You mean odd reps? Why?

    Anyway, here are the usec/point numbers:

    Zen4 8700G Tiger Lake 1135G7
    veclen branchless branchy branchless branchy
    100000 0.030945 0.063620 0.038220 0.080542
    140000 0.031035 0.068244 0.034315 0.084896
    200000 0.038302 0.073819 0.045602 0.089972
    280000 0.037056 0.079651 0.042108 0.096271
    400000 0.046685 0.081895 0.055457 0.104561
    560000 0.043028 0.088356 0.055095 0.113646
    800000 0.051180 0.092201 0.074570 0.123403
    1120000 0.048806 0.096621 0.088121 0.142758
    1600000 0.060206 0.101069 0.131099 0.172171
    2240000 0.073547 0.115428 0.167353 0.205602
    3200000 0.094561 0.139996 0.208903 0.234939
    4500000 0.121049 0.162757 0.244457 0.268286
    6400000 0.152417 0.178611 0.292024 0.295204
    9000000 0.189134 0.192100 0.320426 0.327127
    12800000 0.219408 0.208083 0.372084 0.353530
    18000000 0.237684 0.222140 0.418645 0.389785
    25000000 0.270798 0.236786 0.462689 0.415937
    35000000 0.296994 0.254001 0.526235 0.451467
    50000000 0.330582 0.268768 0.599331 0.478667
    70000000 0.356788 0.288526 0.622659 0.522092
    100000000 0.388326 0.305980 0.698470 0.562841
    140000000 0.407774 0.321496 0.737884 0.609814
    200000000 0.442434 0.336242 0.848403 0.654830
    280000000 0.455125 0.356382 0.902886 0.729970
    400000000 0.496894 0.372735 1.120986 0.777920
    560000000 0.520664 0.393827 1.173606 0.855461
    800000000 0.544343 0.412087 1.759271 0.901011
    1100000000 0.584389 0.431854 1.862866 0.965724
    1600000000 0.614764 0.455844 2.046371 1.027111
    2000000000 0.622513 0.467445 2.149251 1.089775

    So branchy surpasses branchless at veclen=12.8M on both machines, for npoints=100k.

    Concerning the influence of npoints, I have worked with veclen=20M in
    the following.

    1) If <npoints> is small, for branchy the branch predictor will
    learn the pattern on the first repetition, and predict correctly in
    the following repetitions; on Zen4 and Tiger Lake I see the
    following percentage of branch predictions (for
    <npoints>*<rep>=20_000_000, <veclen>=20_000_000):

    npoints Zen4 Tiger Lake
    250 0.03% 0.51%
    500 0.03% 10.40%
    1000 0.03% 13.51%
    2000 0.07% 13.84%
    4000 13.45% 12.61%
    8000 15.26% 12.31%
    16000 15.56% 12.26%
    32000 15.60% 12.23%
    80000 15.59% 12.24%

    Tiger Lake counts slightly more branches than Zen4 (for the same
    binary): 1746M vs. 1703M, but there is also a real lower number of
    mispredictions on Tiger Lake for high npoints; at npoints=80000:
    214M on Tiger Lake vs. 266M on Zen4. My guess is that in Zen4 the
    mispredicts of the deciding branch interfere with the prediction of
    the loop branches, and that the anti-interference measures on Tiger
    Lake results in the branch predictor being less effective at
    npoints=500..2000.

    2) If <npoints> is small all the actually accessed array elements
    will fit into some cache. Also, even if npoints and veclen are
    large enough that they do not all fit, with a smaller <npoints> a
    larger part of the accesses will happen to a cache, and a larger
    part to a lower-level cache. With the same parameters as above,
    branchy sees on a Ryzen 8700G (Zen4 with 16MB L3 cache) the
    following numbers of ls_any_fills_from_sys.all_dram_io
    (LLC-load-misses and l3_lookup_state.l3_miss result in <not
    supported> on this machine), and on a Core i5-1135G (Tiger Lake with
    8MB L3) the following number of LLC-load-misses:

    branchy branchless
    8700G 1135G7 8700G 1135G7
    npoints fills LLC-load-misses fills LLC-load-misses
    250 1_156_206 227_274 1_133_672 39_212
    500 1_189_372 19_820_264 1_125_836 47_994
    1000 1_170_829 125_727_181 1_130_941 96_516
    2000 1_310_015 279_063_572 1_173_501 299_297
    4000 73_528_665 452_147_169 1_151_042 5_661_917
    8000 195_883_759 501_433_404 1_248_638 58_877_208
    16000 301_559_180 511_420_040 2_688_530 101_811_222
    32000 389_512_147 511_713_759 29_799_206 116_312_019
    80000 402_131_449 512_460_513 91_276_752 118_762_341

    The 16MB L3 of the 8700G cache has 262_144 cache lines, the 8MB L3
    of the 1135G7 has 131_072 cache lines. How come branchy has so many
    cache misses already at relatively low npoints? Each search
    performs ~25 architectural memory accesses; in addition, in branchy
    we see a number of misspeculated memory accesses for each
    architectural access, resulting in the additional memory accesses.

    For branchless the 8700G sees a ramp-up of memory accesses more than
    the factor of 2 later compared to the 1135G7 that the differences in
    cache sizes would suggest. The 1135G7 cache is divided into 4 2MB
    slices, and accesses are assigned by physical address (i.e., the L3
    cache does not function like an 8MB cache with a higher
    associativity), but given the random-access nature of this workload,
    I would expect the accesses to be distributed evenly across the
    slices, with little bad effects from this cache organization.

    Given the memory access numbers of branchy and branchless above, I
    expected to see a speedup of branchy in those cases where branchless
    has a lot of L3 misses, so I decided to use npoints=100k and rep=200
    in the experiments further up. And my expectations turned out to be
    right, at least on these two machines.

    - anton










































































































































































    The point at which majority of look-ups misses L3$ at least once
    hopefully will be seen as change of slop on log(N) vs duration
    graph for branchless variant.
    I would not bother with performance counters. At least for me they
    bring more confusion than insite.

    According to my understanding, on Zen4 the ratio of main DRAM
    latency to L3 latency is much higher than on either Coffee Lake or M3,
    both of which have unified LLC instead of split L3$.
    So, if on Zen2 branchy starts to win at ~2x L3 size, I will not be
    shocked. But I will be somewhat surprised.

    I actually have access to Zen3-based EPYC, where the above mentioned
    ratio is supposed much bigger than on any competent client CPU (Intel's >Lunar/Arrow Lake do not belong to this category) but this server is
    currently powered down and it's a bit of a hassle to turn it on.


    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Feb 24 10:46:32 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 22 Feb 2026 13:17:30 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Array and Matrix scientific codes with datasets bigger than
    cache. =20
    =20
    The dense cases are covered by stride-based hardware predictors, so
    they are not "otherwise". I am not familiar enough with sparse
    scientific codes to comment on whether they are 1), 2), or
    "otherwise".
    =20

    BLAS Level 3 is not particularly external/LLC bandwidth intensive
    even without hardware predictors.

    There are HPC applications that are bandwidth limited; that's why they
    have the roofline performance model (e.g.,
    <https://docs.nersc.gov/tools/performance/roofline/>).

    Sure. But that's not what I would call "dense".
    In my vocabulary "dense" starts at matmul(200x200,x200) or at LU >decomposition of matrix of similar dimensions.

    "Dense matrices" vs. "sparse matrices", not in terms of FLOPS/memory
    access.

    So adding two dense matrices tends to be memory bandwidth bound, but stride-based prefetchers help to avoid getting any extra latency
    beyond that coming from the bandwidth limits (if any).

    Likewise, John McCalpin's Stream benchmark uses dense vectors IIRC,
    but is memory bandwidth limited.

    Overwhelming majority of data served from
    L2 cache.
    That's with classic SIMD. It's possible that with AMX units it's no
    longer true.

    I very much doubt that. There is little point in adding an
    instruction that slows down execution by turning it from compute-bound
    to memory-bound.


    It does not slow down the execution. To the contrary, it speeds it up
    so much that speed of handling of L2 misses begins to matter.
    Pay attention that it's just my speculation that can be wrong.
    IIRC, right now binary64-capable AMX is available only on Apple Silicon
    (via SME) and may be on IBM Z. I didn't play with either.

    AMX is an Intel extension of AMD64 (and IA-32); ARM's SME also has
    "matrix" in its name, but is not AMX. Looking at <https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions>, it seems
    to me that the point of AMX is to deal with small matrices (16x64
    times 64x16 for Int8, 16x32 times 32x16 for 16-bit types) of small
    elements (INT8, BF16, FP16 and complex FP16 numbers) in a special
    unit. Apparently the AMX unit in Granite Rapids consumes 2048 bytes
    in 16 cycles, i.e., 128 bytes per cycle and produces 256 or 512 bytes
    in these 16 cycles. If every of these matrix multiplication happens
    happens on its own, the result will certainly be bandwidth-bound to L2
    and maybe already to L1. If, OTOH, these operations are part of a
    larger matrix multiplication, then cache blocking can probably lower
    the bandwidth to L2 enough, and reusing one of the operands in
    registers can lower the bandwidth to L1 enough.

    In any case, Intel will certainly not add hardware that exceeds the
    bandwidth boundaries in all common use cases.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Feb 24 11:25:08 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth >bound.

    If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can
    help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems. For example, imagine a company employee file (table), >with a (say 300 byte) record for each of its many thousands of employees >each containing typical employee stuff). Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales" >department. With no index on "department", but it is at a fixed >displacement within each record, the code looks at each record, does a >trivial test on it, perhaps adds to a register, then goes to the next >record. This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming
    language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem
    into a latency-bound problem. If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of
    pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2
    or L3 cache of modern processors.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Feb 24 12:30:56 2026
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    So adding two dense matrices tends to be memory bandwidth bound, but stride-based prefetchers help to avoid getting any extra latency
    beyond that coming from the bandwidth limits (if any).

    Likewise, John McCalpin's Stream benchmark uses dense vectors IIRC,
    but is memory bandwidth limited.

    CFD codes are usually memory bandwidth limited; they use sparse
    matrices.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Feb 24 17:23:27 2026
    From Newsgroup: comp.arch

    On Tue, 24 Feb 2026 09:50:43 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    The recent comparisons of branchy vs branchless binary search
    that we carried on RWT forum seems to suggest that on modern CPUs
    branchless variant is faster even when the table does not fit in
    LLC.=20

    Only two explanations come to my mind:

    1) The M3 has a hardware prefetcher that recognizes the pattern of
    a binary array search and prefetches accordingly. The cache
    misses from page table accesses might confuse the prefetcher,
    leading to worse performance eventually.


    Coffee Lake certainly has no such prefetcher and nevertheless
    exhibits similar behavior.

    I have now looked more closely, and the npoints parameter has a
    significant influence. If it is small enough, branchless fits into
    caches while branchy does not (see below), which might be one
    explanation for the results. Given that I have not seen npoints
    specified in any of the postings that mention branchless winning for
    veclen sizes exceeding the L3 cache size, it could be this effect.

    I think that it was said more than once throughout the thread that all measurements were taken with npoints=1M and rep=11.

    Or
    maybe some other effect. Without proper parameters one cannot
    reproduce the experiment to investigate it.

    I tried to run your code
    <https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
    on Zen4, but clang-14 converts uut2.c into branchy code. I could
    not get gcc-12 to produce branchless code from slightly adapted
    source code, either. My own attempt of using extended asm did not
    pass your sanity checks, so eventually I used the assembly code
    produced by clang-19 through godbolt.

    I suppose that you have good reason for avoiding installation of
    clang17 or later on one of your computers.

    Sure. It costs time.

    If you have recommendations what to use for the other parameters, I
    can run other sizes as well.

    - anton


    Run for every size from 100K to 2G in increments of x sqrt(2).

    There is no integer n where 2G=100k*sqrt(2)^n. So I used the numbers
    shown below. You did not give any indication on npoints, so I
    investigated myself, and found that branchless will miss the L3 cache
    with npoints=100000, so I used that and used reps=200.

    BTW, I prefer odd number of iterations.

    You mean odd reps? Why?

    Because I like quick tests. Since I always use large npoints, for test
    to finish quickly I have to use small rep. And for small rep
    even-size median has non-negligible bias.



    Anyway, here are the usec/point numbers:

    Zen4 8700G Tiger Lake 1135G7
    veclen branchless branchy branchless branchy
    100000 0.030945 0.063620 0.038220 0.080542
    140000 0.031035 0.068244 0.034315 0.084896
    200000 0.038302 0.073819 0.045602 0.089972
    280000 0.037056 0.079651 0.042108 0.096271
    400000 0.046685 0.081895 0.055457 0.104561
    560000 0.043028 0.088356 0.055095 0.113646
    800000 0.051180 0.092201 0.074570 0.123403
    1120000 0.048806 0.096621 0.088121 0.142758
    1600000 0.060206 0.101069 0.131099 0.172171
    2240000 0.073547 0.115428 0.167353 0.205602
    3200000 0.094561 0.139996 0.208903 0.234939
    4500000 0.121049 0.162757 0.244457 0.268286
    6400000 0.152417 0.178611 0.292024 0.295204
    9000000 0.189134 0.192100 0.320426 0.327127
    12800000 0.219408 0.208083 0.372084 0.353530
    18000000 0.237684 0.222140 0.418645 0.389785
    25000000 0.270798 0.236786 0.462689 0.415937
    35000000 0.296994 0.254001 0.526235 0.451467
    50000000 0.330582 0.268768 0.599331 0.478667
    70000000 0.356788 0.288526 0.622659 0.522092
    100000000 0.388326 0.305980 0.698470 0.562841
    140000000 0.407774 0.321496 0.737884 0.609814
    200000000 0.442434 0.336242 0.848403 0.654830
    280000000 0.455125 0.356382 0.902886 0.729970
    400000000 0.496894 0.372735 1.120986 0.777920
    560000000 0.520664 0.393827 1.173606 0.855461
    800000000 0.544343 0.412087 1.759271 0.901011
    1100000000 0.584389 0.431854 1.862866 0.965724
    1600000000 0.614764 0.455844 2.046371 1.027111
    2000000000 0.622513 0.467445 2.149251 1.089775

    So branchy surpasses branchless at veclen=12.8M on both machines, for npoints=100k.

    Concerning the influence of npoints, I have worked with veclen=20M in
    the following.

    1) If <npoints> is small, for branchy the branch predictor will
    learn the pattern on the first repetition, and predict correctly in
    the following repetitions; on Zen4 and Tiger Lake I see the
    following percentage of branch predictions (for
    <npoints>*<rep>=20_000_000, <veclen>=20_000_000):

    npoints Zen4 Tiger Lake
    250 0.03% 0.51%
    500 0.03% 10.40%
    1000 0.03% 13.51%
    2000 0.07% 13.84%
    4000 13.45% 12.61%
    8000 15.26% 12.31%
    16000 15.56% 12.26%
    32000 15.60% 12.23%
    80000 15.59% 12.24%

    Tiger Lake counts slightly more branches than Zen4 (for the same
    binary): 1746M vs. 1703M, but there is also a real lower number of
    mispredictions on Tiger Lake for high npoints; at npoints=80000:
    214M on Tiger Lake vs. 266M on Zen4. My guess is that in Zen4 the
    mispredicts of the deciding branch interfere with the prediction of
    the loop branches, and that the anti-interference measures on Tiger
    Lake results in the branch predictor being less effective at
    npoints=500..2000.

    2) If <npoints> is small all the actually accessed array elements
    will fit into some cache. Also, even if npoints and veclen are
    large enough that they do not all fit, with a smaller <npoints> a
    larger part of the accesses will happen to a cache, and a larger
    part to a lower-level cache. With the same parameters as above,
    branchy sees on a Ryzen 8700G (Zen4 with 16MB L3 cache) the
    following numbers of ls_any_fills_from_sys.all_dram_io
    (LLC-load-misses and l3_lookup_state.l3_miss result in <not
    supported> on this machine), and on a Core i5-1135G (Tiger Lake
    supported> with
    8MB L3) the following number of LLC-load-misses:

    branchy branchless
    8700G 1135G7 8700G 1135G7
    npoints fills LLC-load-misses fills LLC-load-misses
    250 1_156_206 227_274 1_133_672 39_212
    500 1_189_372 19_820_264 1_125_836 47_994
    1000 1_170_829 125_727_181 1_130_941 96_516
    2000 1_310_015 279_063_572 1_173_501 299_297
    4000 73_528_665 452_147_169 1_151_042 5_661_917
    8000 195_883_759 501_433_404 1_248_638 58_877_208
    16000 301_559_180 511_420_040 2_688_530 101_811_222
    32000 389_512_147 511_713_759 29_799_206 116_312_019
    80000 402_131_449 512_460_513 91_276_752 118_762_341

    The 16MB L3 of the 8700G cache has 262_144 cache lines, the 8MB L3
    of the 1135G7 has 131_072 cache lines. How come branchy has so many
    cache misses already at relatively low npoints? Each search
    performs ~25 architectural memory accesses; in addition, in branchy
    we see a number of misspeculated memory accesses for each
    architectural access, resulting in the additional memory accesses.

    For branchless the 8700G sees a ramp-up of memory accesses more than
    the factor of 2 later compared to the 1135G7 that the differences in
    cache sizes would suggest. The 1135G7 cache is divided into 4 2MB
    slices, and accesses are assigned by physical address (i.e., the L3
    cache does not function like an 8MB cache with a higher
    associativity), but given the random-access nature of this workload,
    I would expect the accesses to be distributed evenly across the
    slices, with little bad effects from this cache organization.

    Given the memory access numbers of branchy and branchless above, I
    expected to see a speedup of branchy in those cases where branchless
    has a lot of L3 misses, so I decided to use npoints=100k and rep=200
    in the experiments further up. And my expectations turned out to be
    right, at least on these two machines.

    - anton


    Intuitively, I don't like measurements with so smal npoints parameter,
    but objectively they are unlikely to be much different from 1M points.
    Looking at your results do not differ radically from those we do on
    RWT forum.

    CPU LLC Cross Point sz/LLC
    TGL 8 12.8 12.8
    CFL 12 30.0 15.0
    Zen4 16 12.8 6.4
    M3 24 200.0 66.7
    RPT 33 1300.0 315.2

    The last line is Intel Raptor Lake (i7-14700). On this CPU I had to use
    1/3rd of installed memory in order to find a cross point.
    Here at vecsize=41M, which is ten times bigger than LLC, branchless is
    still almost 1.5x faster than branchy (0.216 vs 0.315 usec/point)

    In all cases ratio of vector size at cross-point to LLC size is much
    bigger than 1.
    On Tiger Lake ratio is smaller than on others, probably because of
    Intel's slow "mobile" memory controller, but even here it is
    significantly above 1.

    So far I don't see how your measurements disprove my theory of page
    table not fitting in L2 cache.







    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Feb 24 07:51:13 2026
    From Newsgroup: comp.arch

    On 2/24/2026 3:25 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth
    bound.

    If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can
    help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems. For example, imagine a company employee file (table),
    with a (say 300 byte) record for each of its many thousands of employees
    each containing typical employee stuff). Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales"
    department. With no index on "department", but it is at a fixed
    displacement within each record, the code looks at each record, does a
    trivial test on it, perhaps adds to a register, then goes to the next
    record. This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.


    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem
    into a latency-bound problem. If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of
    pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2
    or L3 cache of modern processors.

    Yes, I miscalculated. My intent was to force a DRAM access for each
    record, which would make the problem worse (DRAM access time versus L3
    access time). But I think the same issue would apply, even it it fits
    in an L3 cache, but if it doesn't, increase the record size or number of records so that it doesn't fit in L3. But this just changes the number
    of prefetches in process needed to prevent it from becoming latency bound.

    Thanks.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Feb 24 18:26:22 2026
    From Newsgroup: comp.arch

    On Tue, 24 Feb 2026 10:46:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 22 Feb 2026 13:17:30 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Array and Matrix scientific codes with datasets bigger than
    cache. =20
    =20
    The dense cases are covered by stride-based hardware
    predictors, so they are not "otherwise". I am not familiar
    enough with sparse scientific codes to comment on whether they
    are 1), 2), or "otherwise".
    =20

    BLAS Level 3 is not particularly external/LLC bandwidth intensive
    even without hardware predictors.

    There are HPC applications that are bandwidth limited; that's why
    they have the roofline performance model (e.g.,
    <https://docs.nersc.gov/tools/performance/roofline/>).

    Sure. But that's not what I would call "dense".
    In my vocabulary "dense" starts at matmul(200x200,x200) or at LU >decomposition of matrix of similar dimensions.

    "Dense matrices" vs. "sparse matrices", not in terms of FLOPS/memory
    access.

    So adding two dense matrices tends to be memory bandwidth bound, but stride-based prefetchers help to avoid getting any extra latency
    beyond that coming from the bandwidth limits (if any).

    Likewise, John McCalpin's Stream benchmark uses dense vectors IIRC,
    but is memory bandwidth limited.

    Overwhelming majority of data served from
    L2 cache.
    That's with classic SIMD. It's possible that with AMX units it's
    no longer true.

    I very much doubt that. There is little point in adding an
    instruction that slows down execution by turning it from
    compute-bound to memory-bound.


    It does not slow down the execution. To the contrary, it speeds it up
    so much that speed of handling of L2 misses begins to matter.
    Pay attention that it's just my speculation that can be wrong.
    IIRC, right now binary64-capable AMX is available only on Apple
    Silicon (via SME) and may be on IBM Z. I didn't play with either.

    AMX is an Intel extension of AMD64 (and IA-32); ARM's SME also has
    "matrix" in its name, but is not AMX.

    Apple's extension was called AMX back when it was accessible only
    through close-source library calls and had no instruction-level
    documentation. Later on Apple pushed it through Arm's standardization
    process and since than everybody call it SME. But it is the same thing.

    Looking at
    <https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions>, it seems
    to me that the point of AMX is to deal with small matrices (16x64
    times 64x16 for Int8, 16x32 times 32x16 for 16-bit types) of small
    elements (INT8, BF16, FP16 and complex FP16 numbers) in a special
    unit. Apparently the AMX unit in Granite Rapids consumes 2048 bytes
    in 16 cycles, i.e., 128 bytes per cycle and produces 256 or 512 bytes
    in these 16 cycles. If every of these matrix multiplication happens
    happens on its own, the result will certainly be bandwidth-bound to L2
    and maybe already to L1. If, OTOH, these operations are part of a
    larger matrix multiplication, then cache blocking can probably lower
    the bandwidth to L2 enough, and reusing one of the operands in
    registers can lower the bandwidth to L1 enough.


    The question is not whether bandwidth from L2 to AMX is sufficient. It
    is.
    The interesting part is what bandwidth from LLC *into* L2 is needed in
    scenario of multiplication of big matrices.
    Supposedly 1/3rd of L2 is used for matrix A that stays here for very
    long time, another 1/3rd holds matrix C that remains here for, say, 8 iterations and remaining 1/3rd is matrix B that should be reloaded on
    every iteration. AMX increases the frequency at which we have to reload
    B and C.
    Assuming L2 = 2MB and square matrices , N = sqrt(2MB/8B/3) = 295,
    rounded down to 288. Each step takes 288**3= 24M FMA operations. If
    future AMX does 128 FMA per clock then the iteration take 188K clocks.
    288**2 * 8B / 188K = 3.5 B/clock. That's easy for 1 AMX/L2 combo, but
    very hard when you have 64 units like those competing on the same
    LLC.
    Yes, suggestion of 128 FMA/clock was exaggerated, but 64 is realistic
    and 32 is an absolute minimum. Building double-precision capable AMX is
    not worth the trouble if it can not do at least 32 FMAs.

    In any case, Intel will certainly not add hardware that exceeds the
    bandwidth boundaries in all common use cases.

    - anton

    I wouldn't be so sure.
    In relatively recent history Intel released Knights Landing which was unbalanced to extreme. Neither instruction decoders nor register ports
    nor caches were sufficient to feed its vector units at any workload
    that even remotely resembled useful algorithm.










    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Feb 24 17:30:31 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 24 Feb 2026 09:50:43 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    [...]
    I think that it was said more than once throughout the thread that all >measurements were taken with npoints=1M and rep=11.

    I obviously missed it, that's why I asked for the parameters earlier.

    And for small rep
    even-size median has non-negligible bias.

    Even-size median is the arithmetic mean between the (n-1)/2-lowest and
    the (n+1)/2-lowest result. What bias do you have in mind?

    Here are the results with npoints=1M reps=11:

    Zen4 8700G Tiger Lake 1135G7
    veclen branchless branchy branchless branchy
    100000 0.030187 0.063023 0.038614 0.081332
    140000 0.029779 0.066112 0.034797 0.085259
    200000 0.037532 0.072898 0.045945 0.090583
    280000 0.036677 0.078795 0.042678 0.097386
    400000 0.045777 0.084639 0.055256 0.104249
    560000 0.043092 0.086424 0.055677 0.112252
    800000 0.050325 0.091714 0.078172 0.123927
    1120000 0.048855 0.095756 0.091508 0.141819
    1600000 0.061101 0.099660 0.133560 0.168403
    2240000 0.082970 0.113055 0.166363 0.205129
    3200000 0.120834 0.137753 0.211253 0.235155
    4500000 0.149028 0.160359 0.241716 0.268885
    6400000 0.178675 0.177070 0.294336 0.294931
    9000000 0.219919 0.195292 0.322921 0.326113
    12800000 0.240604 0.209397 0.380076 0.352707
    18000000 0.257400 0.224952 0.409813 0.388834
    25000000 0.293160 0.239217 0.472999 0.413559
    35000000 0.307939 0.257587 0.522223 0.450754
    50000000 0.346399 0.271420 0.595265 0.478713
    70000000 0.375525 0.286462 0.626350 0.516533
    100000000 0.401462 0.305444 0.699097 0.555928
    140000000 0.419590 0.322480 0.732551 0.611815
    200000000 0.451003 0.337880 0.854034 0.647795
    280000000 0.468193 0.359133 0.914159 0.729234
    400000000 0.506300 0.372219 1.150605 0.773266
    560000000 0.520346 0.390277 1.167567 0.851592
    800000000 0.556292 0.412035 1.799293 0.899768
    1100000000 0.572654 0.432598 1.861136 0.963649
    1600000000 0.618735 0.450526 2.062255 1.026560
    2000000000 0.629088 0.465219 2.132035 1.073728

    Intuitively, I don't like measurements with so smal npoints parameter,
    but objectively they are unlikely to be much different from 1M points.

    It's pretty similar, but the first point where branchy is faster is
    veclen=6.4M for the 8700G and still 12.8M for the 1135G7; on the
    1135G7 the 6.4M and 9M veclens are pretty close for both npoints.

    My current theory why the crossover is so late is twofold:

    1) branchy performs a lot of misspeculated accesses, which reduce the
    cache hit rate (and increasing the cache miss rate) already at
    relatively low npoints (as shown in <2026Feb24.105043@mips.complang.tuwien.ac.at>), and likely also for
    low veclen with high npoints. E.g., already npoints=4000 has a >2M
    fills from memory on the 8700G, and npoints=500 has >2M
    LLC-load-misses at 1135G7, whereas the corresponding numbers for
    branchless are npoints=16000 and npoints=4000. This means that
    branchy does not just suffer from the misprediction penalty, but also
    from fewer cache hits for the middle stages of the search. How strong
    this effect is depends on the memory subsystem of the CPU core.

    2) In the last levels of the larger searches the better memory-level parallelism of branchy leads to branchy catching up and eventually
    overtaking branchless, but it has to first compensate for the slowness
    in the earlier levels before it can reach crossover. That's why we
    see the crossover point at significantly larger sizes than L3.

    On Tiger Lake ratio is smaller than on others, probably because of
    Intel's slow "mobile" memory controller

    This particular machine has 8GB DDR4 soldered in plus 32GB DDR4 on a
    separate DIMM, so the memory controller may be faultless.

    So far I don't see how your measurements disprove my theory of page
    table not fitting in L2 cache.

    I did not try to disprove that; if I did, I would try to use huge
    pages (the 1G ones if possible) and see how that changes the results.

    But somehow I fail to see why the page table walks should make much of
    a difference.

    If I find the time, I would like to see how branchless with software prefetching performs. And I would like to put all of that, including
    your code, online. Do I have your permission to do so?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Feb 24 22:22:01 2026
    From Newsgroup: comp.arch

    On Tue, 24 Feb 2026 17:30:31 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    If I find the time, I would like to see how branchless with software prefetching performs. And I would like to put all of that, including
    your code, online. Do I have your permission to do so?

    - anton

    No problems.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Feb 25 07:33:06 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Let me better explain what I was trying to set up, then you can tell me >where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few >instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level parallelism of the hardware. If the records are in RAM, the hardware prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I have not looked closely at hardware prefetchers, but I am sure that
    their designers have taken into account how far they have to prefetch
    such that the loads they prefetch for usually see a short latency.

    But prefetchers may not play a role anyway: A large number of (more
    important) demand loads will probably suppress the prefetcher. But
    that's no problem: modern uarchs tend to support enough outstanding
    loads to utilize the full bandwidth available to the core. E.g., Zen4
    has 64GB/s bandwidth from one core (actually one CCX) to the memory controllers, i.e., it can deliver 1 cache line/ns to the core from
    RAM. It also has 88 entries in the load-execution queue, which are
    good to cover 88*1ns=88ns of latency, which is roughly the latency one
    sees from DDR5 DIMMs; with the higher-latency LPDDR5 a single core may
    not be able to make use of all available latency.

    If there are enough instructions between the loads such that the ROB
    fills before the load-execution queue, demand loads may not be able to
    fully utilize the bandwidth, but that leaves slots open for the
    prefetcher, so the core will still make use of the available
    bandwidth.

    Of course, once you reach the bandwidth limit, loads that queue up
    behind other loads for attention from the memory subsystem will see
    more latency than in an uncongested memory system, but one does not
    consider that to be latency-bound.

    But that's all theoretical considerations; you could try to write a
    program with these characteristics, and see how it performs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Feb 25 08:17:53 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 24 Feb 2026 10:46:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    The question is not whether bandwidth from L2 to AMX is sufficient. It
    is.
    The interesting part is what bandwidth from LLC *into* L2 is needed in >scenario of multiplication of big matrices.
    Supposedly 1/3rd of L2 is used for matrix A that stays here for very
    long time, another 1/3rd holds matrix C that remains here for, say, 8 >iterations and remaining 1/3rd is matrix B that should be reloaded on
    every iteration. AMX increases the frequency at which we have to reload
    B and C.
    Assuming L2 = 2MB and square matrices , N = sqrt(2MB/8B/3) = 295,
    rounded down to 288. Each step takes 288**3= 24M FMA operations. If
    future AMX does 128 FMA per clock then the iteration take 188K clocks.
    288**2 * 8B / 188K = 3.5 B/clock. That's easy for 1 AMX/L2 combo, but
    very hard when you have 64 units like those competing on the same
    LLC.

    The 64 units are from 64 cores? I have not looked at the kind of
    bandwidth that the 2D grid interconnect between the cores and their L3
    cache slices offers to the cores of the big Xeons, but assuming that
    the bandwidth is insufficient for that, one will probably buy a chip
    with fewer cores if all that the cores are intended to do is to
    multiply matrices.

    Given that these cores are also used for applications other than
    matrix multiplication, and for some of those it makes sense to have
    that many cores (or more) even with that bandwidth limit to L3, I
    cannot fault Intel for building such CPUs. And I also cannot fault
    them for designing cores in a way that some hardware cannot be fully
    utilized by some applications in some of the CPU configurations that
    use that core. As long as there is an application where the hardware
    can be fully utilized for some CPU configuration, and the additional
    hardware benefits this application enough, the hardware feature
    technically makes sense (for commercial sense additional
    considerations apply).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Feb 25 15:07:23 2026
    From Newsgroup: comp.arch

    On Tue, 24 Feb 2026 17:30:31 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 24 Feb 2026 09:50:43 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    [...]
    I think that it was said more than once throughout the thread that
    all measurements were taken with npoints=1M and rep=11.

    I obviously missed it, that's why I asked for the parameters earlier.

    And for small rep
    even-size median has non-negligible bias.

    Even-size median is the arithmetic mean between the (n-1)/2-lowest and
    the (n+1)/2-lowest result. What bias do you have in mind?


    You had seen my code. It has no special handling for even nrep.
    It simply takes dt[nrep/2], like in odd case.


    Here are the results with npoints=1M reps=11:

    Zen4 8700G Tiger Lake 1135G7
    veclen branchless branchy branchless branchy
    100000 0.030187 0.063023 0.038614 0.081332
    140000 0.029779 0.066112 0.034797 0.085259
    200000 0.037532 0.072898 0.045945 0.090583
    280000 0.036677 0.078795 0.042678 0.097386
    400000 0.045777 0.084639 0.055256 0.104249
    560000 0.043092 0.086424 0.055677 0.112252
    800000 0.050325 0.091714 0.078172 0.123927
    1120000 0.048855 0.095756 0.091508 0.141819
    1600000 0.061101 0.099660 0.133560 0.168403
    2240000 0.082970 0.113055 0.166363 0.205129
    3200000 0.120834 0.137753 0.211253 0.235155
    4500000 0.149028 0.160359 0.241716 0.268885
    6400000 0.178675 0.177070 0.294336 0.294931
    9000000 0.219919 0.195292 0.322921 0.326113
    12800000 0.240604 0.209397 0.380076 0.352707
    18000000 0.257400 0.224952 0.409813 0.388834
    25000000 0.293160 0.239217 0.472999 0.413559
    35000000 0.307939 0.257587 0.522223 0.450754
    50000000 0.346399 0.271420 0.595265 0.478713
    70000000 0.375525 0.286462 0.626350 0.516533
    100000000 0.401462 0.305444 0.699097 0.555928
    140000000 0.419590 0.322480 0.732551 0.611815
    200000000 0.451003 0.337880 0.854034 0.647795
    280000000 0.468193 0.359133 0.914159 0.729234
    400000000 0.506300 0.372219 1.150605 0.773266
    560000000 0.520346 0.390277 1.167567 0.851592
    800000000 0.556292 0.412035 1.799293 0.899768
    1100000000 0.572654 0.432598 1.861136 0.963649
    1600000000 0.618735 0.450526 2.062255 1.026560
    2000000000 0.629088 0.465219 2.132035 1.073728

    Intuitively, I don't like measurements with so smal npoints
    parameter, but objectively they are unlikely to be much different
    from 1M points.

    It's pretty similar, but the first point where branchy is faster is veclen=6.4M for the 8700G and still 12.8M for the 1135G7; on the
    1135G7 the 6.4M and 9M veclens are pretty close for both npoints.


    Here are my Raptor Lake results (Raptor Cove and Gracemont)
    i7-14700-P i7-14700-E
    veclen branchless branchy branchless branchy
    100000 0.025580 0.065588 0.055901 0.084580
    140000 0.022006 0.067718 0.055513 0.088897
    200000 0.029240 0.071393 0.066058 0.092738
    280000 0.026032 0.074760 0.065444 0.096932
    400000 0.034791 0.082914 0.076366 0.101067
    560000 0.033920 0.089877 0.078420 0.107085
    800000 0.046743 0.098518 0.086972 0.115900
    1120000 0.045401 0.108028 0.088823 0.125935
    1600000 0.059368 0.119298 0.112489 0.139820
    2240000 0.060052 0.127647 0.121007 0.154747
    3200000 0.081294 0.140325 0.151804 0.173241
    4500000 0.090567 0.158380 0.169017 0.198011
    6400000 0.134527 0.184986 0.221871 0.228213
    9000000 0.139166 0.216010 0.247091 0.265844
    12800000 0.158295 0.244137 0.314549 0.302832
    18000000 0.198464 0.275680 0.342068 0.336601
    25000000 0.232103 0.296431 0.425518 0.369543
    35000000 0.249135 0.318954 0.431165 0.404154
    50000000 0.291809 0.342103 0.537924 0.441044
    70000000 0.312596 0.366850 0.548441 0.473681
    100000000 0.366150 0.390468 0.672164 0.511968
    140000000 0.365590 0.418052 0.684060 0.547667
    200000000 0.414727 0.453255 0.814007 0.592588
    280000000 0.470445 0.476864 0.841394 0.642071
    400000000 0.497190 0.511605 0.975004 0.702385
    560000000 0.524168 0.545820 1.006532 0.771940
    800000000 0.555521 0.577105 1.178287 0.857243
    900000000 0.587890 0.595592 1.197315 0.881594
    1000000000 0.605113 0.605211 1.221586 0.901517
    1100000000 0.582061 0.632543 1.228431 0.939851
    1200000000 0.623949 0.641601 1.262782 0.953194
    1300000000 0.653287 0.649376 1.283567 0.971780
    1400000000 0.758616 0.651059 1.363853 0.990429
    1500000000 0.675992 0.656111 1.387015 1.009316
    1600000000 0.686154 0.680118 1.424667 1.025686
    1700000000 0.692339 0.693832 1.428070 1.036490
    1800000000 0.703114 0.694311 1.440372 1.054245
    1900000000 0.693015 0.690148 1.446149 1.063155
    2000000000 0.688864 0.714125 1.468993 1.088992

    And here is Coffee Lake
    Xeon E-2176G
    veclen branchless branchy
    100000 0.047010 0.091787
    140000 0.045510 0.097623
    200000 0.055983 0.104018
    280000 0.055245 0.109927
    400000 0.067691 0.116034
    560000 0.070107 0.121319
    800000 0.082146 0.127841
    1120000 0.086229 0.138181
    1600000 0.111000 0.153429
    2240000 0.120780 0.175006
    3200000 0.151256 0.199086
    4500000 0.164724 0.223560
    6400000 0.202989 0.244470
    9000000 0.215630 0.268854
    12800000 0.268318 0.290898
    18000000 0.281772 0.315263
    25000000 0.340269 0.336805
    35000000 0.350421 0.360242
    50000000 0.426758 0.385225
    70000000 0.432946 0.411350
    100000000 0.509785 0.441351
    140000000 0.528119 0.472827
    200000000 0.618676 0.510482
    280000000 0.643984 0.548418
    400000000 0.747189 0.590660
    560000000 0.782344 0.628088
    800000000 0.897024 0.671247
    900000000 0.911065 0.682892
    1000000000 0.930987 0.692015
    1100000000 0.920161 0.708865
    1200000000 0.940487 0.718227
    1300000000 0.966389 0.728277
    1400000000 0.984855 0.737065
    1500000000 1.025050 0.745294
    1600000000 1.048256 0.756237
    1700000000 1.064021 0.761571
    1800000000 1.063617 0.765608
    1900000000 1.068122 0.773536
    2000000000 1.074109 0.778483


    My current theory why the crossover is so late is twofold:

    1) branchy performs a lot of misspeculated accesses, which reduce the
    cache hit rate (and increasing the cache miss rate) already at
    relatively low npoints (as shown in <2026Feb24.105043@mips.complang.tuwien.ac.at>), and likely also for
    low veclen with high npoints. E.g., already npoints=4000 has a >2M
    fills from memory on the 8700G, and npoints=500 has >2M
    LLC-load-misses at 1135G7, whereas the corresponding numbers for
    branchless are npoints=16000 and npoints=4000. This means that
    branchy does not just suffer from the misprediction penalty, but also
    from fewer cache hits for the middle stages of the search. How strong
    this effect is depends on the memory subsystem of the CPU core.

    2) In the last levels of the larger searches the better memory-level parallelism of branchy leads to branchy catching up and eventually
    overtaking branchless, but it has to first compensate for the slowness
    in the earlier levels before it can reach crossover. That's why we
    see the crossover point at significantly larger sizes than L3.

    On Tiger Lake ratio is smaller than on others, probably because of
    Intel's slow "mobile" memory controller

    This particular machine has 8GB DDR4 soldered in plus 32GB DDR4 on a
    separate DIMM, so the memory controller may be faultless.


    Do you mean that dealing with two different types of memory is
    objectively hard job?
    Or that latency of 1135G7 is not to bad?
    I can agree with the former, but not with the latter.
    "Branchles" results for veclen in range 2M-10M that should be
    good indication of latency of main RAM are just not good on
    this gear. And slop of duration vs log(veclen) graph taken at this
    range, which is probably even better indication of memory latency,
    is also not good.
    It all looks lake memory latency on this TGL is ~90ns, much higher than
    the rest of them. That is, except for Gracemont, where very high
    measured latency is probably cupped by some sort of power-saving policy
    and not related to memory controller itself.

    So far I don't see how your measurements disprove my theory of page
    table not fitting in L2 cache.

    I did not try to disprove that; if I did, I would try to use huge
    pages (the 1G ones if possible) and see how that changes the results.


    That sounds like an excellent idea.

    But somehow I fail to see why the page table walks should make much of
    a difference.

    If I find the time, I would like to see how branchless with software prefetching performs. And I would like to put all of that, including
    your code, online. Do I have your permission to do so?

    - anton

    Does not SW prefetch turn itself into NOP on TLB miss?
    I remember reading something of that sort, but don't remember what CPU
    was discussed.
    Unfortunately, documentation rarely provide this sort of details.





    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Feb 26 09:08:29 2026
    From Newsgroup: comp.arch

    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level parallelism of the hardware. If the records are in RAM, the hardware prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited. Think of it this way. In the current situation, increasing the memory system bandwidth, say by hypothetically increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand,
    doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance. To me, that is the definition
    of being latency bound, not bandwidth bound.

    Perhaps this distinction is clearer to me due to my background in the
    (hard) disk business. You want lower latency? Make the arm move faster
    or spin the disk faster. You want higher bandwidth? Put more bits on a
    track or interleave the data across multiple disk heads. And in a
    system, the number of active prefetches is naturally limited by the
    number of disk arms you have.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Feb 26 14:54:53 2026
    From Newsgroup: comp.arch

    On 2/24/2026 5:25 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth
    bound.

    If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can
    help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems. For example, imagine a company employee file (table),
    with a (say 300 byte) record for each of its many thousands of employees
    each containing typical employee stuff). Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales"
    department. With no index on "department", but it is at a fixed
    displacement within each record, the code looks at each record, does a
    trivial test on it, perhaps adds to a register, then goes to the next
    record. This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem
    into a latency-bound problem. If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of
    pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2
    or L3 cache of modern processors.


    FWIW:

    IME, code with fairly random access patterns to memory, and lots of
    cache misses, is inherently slow; even on big/fancy OoO chips. Seemingly
    about the only real hope the CPU has is to have a large cache and just
    hope that the data happens to be in the cache (and has been accessed previously or sufficiently recently) else it is just kinda SOL.

    If there is some way that CPU's can guess what memory they need in
    advance and fetch it beforehand, I have not seen much evidence of this personally.

    Rather, as can be noted, memory access patterns can often make a fairly
    large impact on the performance of some algorithms.


    Like, for example, decoding a PNG like format vs a JPG like format:
    PNG decoding typically processes the image as several major phases:
    Decompress the Deflate-compressed buffer into memory;
    Walk over the image, running scanline filters,
    copying scanlines into a new (output) buffer.

    Even if the parts, taken in isolation, should be fast:
    The image buffers are frequently too large to fit in cache;
    Cache misses tend to make PNG decoding painfully slow,
    even when using faster filters.
    If using the Paeth filter though, this adds extra slowness,
    due to branch-predictor misses.
    On targets like x86,
    the filter is frequently implemented using branches;
    The branch miss rate is very high.
    So, a naive branching version, performs like dog crap.

    So, net result: Despite its conceptual simplicity, PNG's decode-time performance typically sucks.

    Contrast, a decoder for a JPEG like format can be made to process one
    block at a time and go all the way to final output. So, JPEG is often
    faster despite the more complex process (with transform stages and a colorspace transform).


    The Paeth filter slowness does seem a little odd though:
    Theoretically, a CPU could turn a short forward branch into predication;
    But, this doesn't tend to be the case.

    It then is faster to turn the filter into some convoluted mess of
    arithmetic and masking in an attempt to reduce the branch mispredict costs.

    ...

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Feb 27 09:52:46 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound. >>
    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level
    parallelism of the hardware. If the records are in RAM, the hardware
    prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited.

    I mentioned several limits. Which one do you have in mind?

    Think of it this way. In the current
    situation, increasing the memory system bandwidth, say by hypothetically >increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand, >doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance.

    If the CPU is designed to provide enough memory-level parallelism to
    make use of the bandwidth (and that is likely, otherwise why provide
    that much bandwidth), then once the designers spend money on
    increasing the bandwidth, they will also spend the money necessary to
    increase the MLP. Concerning a reduction in latency, that would not
    increase performance, because this application is already working at
    the bandwidth limit.

    I feel the urge to write up a mock variant of your use case and
    measure whether reality confirms my expectations, but currently have
    no time for that.

    But let's take a slightly simpler case for which I already have a
    program: Walk through a 32MB (not MiB in this case) array on a Zen4
    with 16MB L3 linearly with different strides, repeatedly with 100M
    total accesses; the loads in this case are dependent (i.e., pointer
    chasing), but at least for some of the strides that may not matter,
    because the bandwidth limits the performance:

    for i in 128 64 56 48 40 32; do LC_NUMERIC=prog perf stat -e cycles:u ./memory1 linear $[32000000/$i] $i; done

    The resulting cycles for 100M memory accesses are:

    stride cycles:u s (user) bandwidth
    128 3_338_916_538 0.666553000 9.6B/ns
    64 480_517_794 0.100506000 63.7B/ns
    56 457_623_521 0.094127000 59.5B/ns
    48 443_545_135 0.086447000 55.5B/ns
    40 436_435_599 0.086920000 46.0B/ns
    32 427_891_293 0.089242000 35.9B/ns

    The bandwidth is computed based on 64B cache lines, so with stride up
    to 64, stride*100MB are accessed, while with stride>128, 64*100MB are
    accessed.

    The limits for this hardware are 64B/ns bandwidth (from the
    interconnect between the CCX and the rest of the system), and 4 cycles
    minimum load latency (resulting in at least 400M cycles for the 100M
    dependent memory accesses). Ideally one would see something like (at
    5GHz):

    stride cycles:u s (user) bandwidth
    128 500_000_000 0.1 64B/ns
    64 500_000_000 0.1 64B/ns
    56 437_500_000 0.0875 64B/ns
    48 400_000_000 0.08 60B/ns
    40 400_000_000 0.08 50B/ns
    32 400_000_000 0.08 40B/ns

    In practice we see values close to this only for stride=64. For
    smaller strides, the cycles and seconds are a little higher, and the
    bandwidth a little lower.

    For stride=128, the results are much worse. Apparently the memory
    controllers are very good when the accessed lines are successive, but
    much worse when there are holes between the accesses.

    For your scenario with 300B records and 1-2 accesses inside them, we
    will probably see similar slowdowns as for the stride=128 case. I
    have no time for trying this benchmark with only one DIMM (which would
    reduce the maximum bandwidth to about 41GB/s), but given the actual
    observed bandwidth, I doubt that this would change much for the
    stride=128 case.

    Concerning reducing the latency, that obviously would not help in the
    stride=64 case (it is at the bandwidth limit), but for stride=128 it
    might help. Working with a 12MB working set gives:

    stride cycles:u s (user) bandwidth
    192 710_625_989 0.146530000 43.7B/ns
    128 733_132_291 0.154726446 41.4B/ns
    64 425_149_972 0.094751987 67.5B/ns
    56 417_703_739 0.093068039 60.2B/ns
    48 415_704_043 0.092887846 51.7B/ns
    40 415_855_110 0.091790911 43.6B/ns
    32 406_409_135 0.091095578 35.1B/ns

    So even with only L3 involved, stride>64 sees a slowdown, although a
    much slower one one. The stride <= 64 cases are relatively close to
    the limit of 400M cycles, but the clock speed is relatively low (maybe
    due to the power limit of 25W).

    Using independent instead of dependent accesses would make this
    benchmark much closer to your use case, and would make it possible to
    use MLP for demand loads rather than just for hardware prefetchers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Feb 27 18:55:23 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level parallelism of the hardware. If the records are in RAM, the hardware prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited. Think of it this way. In the current situation, increasing the memory system bandwidth, say by hypothetically increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand, doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance. To me, that is the definition
    of being latency bound, not bandwidth bound.

    I agree:

    Many times, increasing memory BW causes a BigO(Ln(BW)) increase in memory latency. For example, when a supercomputer multiplies the number of memory banks, it adds a clock of latency between the CPUs and memory, and adds
    a second clock of delay on the way back. 16|u the memory banks would add another 2 clocks.

    If adding banks causes any degradation in performance you are, by defi-
    nition, latency bound on the application losing performance.

    Perhaps this distinction is clearer to me due to my background in the
    (hard) disk business. You want lower latency? Make the arm move faster
    or spin the disk faster. You want higher bandwidth? Put more bits on a track or interleave the data across multiple disk heads. And in a
    system, the number of active prefetches is naturally limited by the
    number of disk arms you have.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Feb 27 19:04:42 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 2/24/2026 5:25 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth >> bound.

    If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems. For example, imagine a company employee file (table), >> with a (say 300 byte) record for each of its many thousands of employees >> each containing typical employee stuff). Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales"
    department. With no index on "department", but it is at a fixed
    displacement within each record, the code looks at each record, does a
    trivial test on it, perhaps adds to a register, then goes to the next
    record. This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem
    into a latency-bound problem. If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2
    or L3 cache of modern processors.


    FWIW:

    IME, code with fairly random access patterns to memory, and lots of
    cache misses, is inherently slow; even on big/fancy OoO chips.

    There is no ILP when you are sitting waiting on memory.

    Seemingly about the only real hope the CPU has is to have a large cache and just
    hope that the data happens to be in the cache (and has been accessed previously or sufficiently recently) else it is just kinda SOL.

    If there is some way that CPU's can guess what memory they need in
    advance and fetch it beforehand, I have not seen much evidence of this personally.

    I built such a memory system (circa 2003) and it worked wonderfully
    on first 1B cycles when the application was building its data set for
    the first time. Later, as the DAG was manipulated, there were no
    predictors that helped <much>.

    Rather, as can be noted, memory access patterns can often make a fairly large impact on the performance of some algorithms.

    Which is why serious numerics code are written for specific transpose
    orders. See DGEMM as an example.

    Like, for example, decoding a PNG like format vs a JPG like format:
    PNG decoding typically processes the image as several major phases:
    Decompress the Deflate-compressed buffer into memory;
    Walk over the image, running scanline filters,
    copying scanlines into a new (output) buffer.

    FFT has the property that, sooner or later, every next fetch of the
    matrix entry takes a cache miss. Sometimes this is at the beginning
    (Decimation in time) sometimes at the end (Decimation in frequency)
    sometimes in the middle (access pattern is congruent to cache size).

    With matrixes of just the right size, one can achieve a TLB miss on
    every 8th access.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Feb 27 19:31:41 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    With matrixes of just the right size, one can achieve a TLB miss on
    every 8th access.

    Which is why people copy parts of the matrices they multiply
    into separate blocks. If the sizes fit the cache hierarchy,
    it is an excellent tradeoff even though the number operations
    nominally increases.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Feb 28 16:48:39 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 2/24/2026 5:25 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated?-a Tree searching and binary search in arrays come to >>>> mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth >>> bound.

    If a problem is bandwidth-bound, then differences between conventional>> architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can>> help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems.-a For example, imagine a company employee file (table), >>> with a (say 300 byte) record for each of its many thousands of employees >>> each containing typical employee stuff).-a Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales"
    department.-a With no index on "department", but it is at a fixed>>> displacement within each record, the code looks at each record, does a
    trivial test on it, perhaps adds to a register, then goes to the next>>> record.-a This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming>> language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem>> into a latency-bound problem.-a If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of
    pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2>> or L3 cache of modern processors.


    FWIW:

    IME, code with fairly random access patterns to memory, and lots of
    cache misses, is inherently slow; even on big/fancy OoO chips. Seemingly about the only real hope the CPU has is to have a large cache and just > hope that the data happens to be in the cache (and has been accessed
    previously or sufficiently recently) else it is just kinda SOL.

    If there is some way that CPU's can guess what memory they need in
    advance and fetch it beforehand, I have not seen much evidence of this > personally.

    Rather, as can be noted, memory access patterns can often make a fairly large impact on the performance of some algorithms.


    Like, for example, decoding a PNG like format vs a JPG like format:
    -a PNG decoding typically processes the image as several major phases:
    -a-a-a Decompress the Deflate-compressed buffer into memory;
    -a-a-a Walk over the image, running scanline filters,
    -a-a-a-a-a copying scanlines into a new (output) buffer.
    Could you have a secondary thread that started as soon as one (or a
    small number of) scanline(s) were available, taking advantage of any
    shared $L3 cache to grab the data before it is blown away?

    Even if the parts, taken in isolation, should be fast:
    -a The image buffers are frequently too large to fit in cache;
    -a Cache misses tend to make PNG decoding painfully slow,
    -a-a-a even when using faster filters.
    -a-a-a If using the Paeth filter though, this adds extra slowness,
    -a-a-a-a-a due to branch-predictor misses.
    -a-a-a-a-a On targets like x86,
    -a-a-a-a-a-a-a the filter is frequently implemented using branches;
    -a-a-a-a-a-a-a The branch miss rate is very high.
    -a-a-a-a-a-a-a So, a naive branching version, performs like dog crap.
    This reminds me of CABAC decoding in h264, where the output of the
    arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.

    So, net result: Despite its conceptual simplicity, PNG's decode-time performance typically sucks.

    Contrast, a decoder for a JPEG like format can be made to process one
    block at a time and go all the way to final output. So, JPEG is often
    faster despite the more complex process (with transform stages and a colorspace transform).


    The Paeth filter slowness does seem a little odd though:
    Theoretically, a CPU could turn a short forward branch into predication;
    But, this doesn't tend to be the case.

    It then is faster to turn the filter into some convoluted mess of
    arithmetic and masking in an attempt to reduce the branch mispredict costs.
    I would look for a way to handle multiple pixels at once, with SIMD
    code: There the masking/combining is typically the easiest way to
    implement short branches.
    (I might take a look a png decoding at some point)
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Feb 28 10:08:07 2026
    From Newsgroup: comp.arch

    On 2/27/2026 1:52 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level
    parallelism of the hardware. If the records are in RAM, the hardware
    prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited.

    I mentioned several limits. Which one do you have in mind?

    The one you mentioned in your last paragraph, specifically,
    the limit of memory-level parallelism of the hardware.


    Think of it this way. In the current
    situation, increasing the memory system bandwidth, say by hypothetically
    increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand,
    doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance.

    If the CPU is designed to provide enough memory-level parallelism to
    make use of the bandwidth (and that is likely, otherwise why provide
    that much bandwidth), then once the designers spend money on
    increasing the bandwidth, they will also spend the money necessary to increase the MLP.

    No. The memory system throughput depends upon the access pattern. It
    is easier/lower cost to increase the throughput for sequential accesses
    than random (think wider interfaces, cache blocks larger than the amount accessed, etc.) But optimization for sequential workloads can actually
    hurt performance for random workloads, e.g. larger block sizes reduce
    the number of accesses for sequential workloads, but each access takes
    longer, thus hurting random workloads. So designers aim to maximize the throughput, subject to cost and technology constraints, for some mix of sequential (bandwidth) versus random (latency) access.



    Concerning a reduction in latency, that would not
    increase performance, because this application is already working at
    the bandwidth limit.

    Again, perhaps terminology. To me, it has mot maxed out the bandwidth.
    It has maxed out the throughput for this application. But throughput
    has components of latency and bandwidth. They are different, but both
    are important and it is useful for a designer to think of them separately.



    I feel the urge to write up a mock variant of your use case and
    measure whether reality confirms my expectations, but currently have
    no time for that.

    OK.

    But let's take a slightly simpler case for which I already have a
    program:

    I will spend some time necessary to analyze this in more detail, but my impression is that you are using the term "bandwidth" for what I would
    call "throughput". Throughput has two components, bandwidth and
    latency. It is useful to think of them separately.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Feb 25 18:32:25 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 24 Feb 2026 17:30:31 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    This particular machine has 8GB DDR4 soldered in plus 32GB DDR4 on a
    separate DIMM, so the memory controller may be faultless.


    Do you mean that dealing with two different types of memory is
    objectively hard job?

    I think that this setup is likely to provide less memory-level
    parallelism, because probably 80% of the accesses go to only one DRAM
    channel (on the 32GB DIMM), whereas on the 8700G the accesses
    distribute evenly across 4 channels.

    Or that latency of 1135G7 is not to bad?

    I have not measured that before, let's see: Using bplat (pointer
    chasing in a randomized cycle of pointers), I see for the 8700G
    machine with DDR5-5200 and this 1135G7 machine the following
    latencies (in ns):

    size (B) 8700G 1135G7
    1024 0.8 1.2
    2048 0.8 1.2
    4096 0.8 1.2
    8192 0.8 1.2
    16384 0.8 1.2
    32768 0.8 1.2
    65536 2.8 3.3
    131072 2.8 3.4
    262144 2.8 3.4
    524288 3.6 4.2
    1048576 4.5 5.0
    2097152 10.0 11.9
    4194304 10.4 14.6
    8388608 10.3 34.9
    16777216 25.1 72.9
    33554432 77.4 90.4
    67108864 86.8 93.6

    So the main memory latency does not look much worse for the 1135G7
    than for the 8700G. However, in <https://www.complang.tuwien.ac.at/anton/bplat/Results> there are a
    number of machines with significantly better main memory latencies; interestingly, the best result comes from a Rocket Lake machine, a
    close relative of the Tiger Lake in the 1135G7. So you may be right
    about Intel's mobile memory controller configuration.

    It all looks lake memory latency on this TGL is ~90ns

    Right on target:-)

    If I find the time, I would like to see how branchless with software
    prefetching performs. And I would like to put all of that, including
    your code, online. Do I have your permission to do so?

    - anton

    Does not SW prefetch turn itself into NOP on TLB miss?

    I don't know. If the prefetch is for something that will actually be
    fetched, a TLB miss would make the prefetch even more urgent.

    What I expect is that a prefetch to a virtual address that cannot be
    read becomes a noop. Of course, if you are a hardware engineer under
    time pressure, you may take the shortcut of considering the absence
    from the TLB to be an indication of unaccessability.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Feb 28 21:49:35 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound. >>
    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level
    parallelism of the hardware. If the records are in RAM, the hardware
    prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited. Think of it this way. In the current situation, increasing the memory system bandwidth, say by hypothetically increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand, doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance. To me, that is the definition
    of being latency bound, not bandwidth bound.

    I agree with your definition, but my prediction is somewhat different.
    First, consider silly program that goes sequentially over larger
    array accessing all lines. AFAICS it you should see tiny effect when
    program uses only one byte from each line compared to using whole
    line. Now consider variant that accesses every fifth line.
    There are differences, one that prefetcher needs to realize that
    there is no need to prefetch intermediate lines. Second difference
    is that one can fetch lines quickly only when they are on a single
    page. Having "step 5" on lines means 5 times as many page crossings.
    I do not know how big are pages in modern DRAM, but at step large
    enough you will see significant delay due to page crossing. I
    would tend to call this delay "latency", but it is somewhat
    murky. Namely, with enough prefetch and enough memory banks
    you can still saturate a single channel to the core (assuming
    that there are many cores, many channels from memory controller
    to memory banks but only single channel between memory controller
    and each core. Of course, modern system tend to have limited
    number of memory banks, so the argument above is purely theoretical.

    Somewhat different case is when there are independent loads from
    random locations, something like

    for(i = 0; i < N; i++) {
    s += m[f(i)];
    }

    where 'f' is very cheap to compute, but hard to predict by the
    hardware. In case above reorder buffer and multiple banks helps,
    but even with unlimited CPU resurces maximal number of accesses
    is number of memory banks divided by access time of single bank
    (that is essentialy latency of memory array).

    Then there is pointer chasing case, like

    for(i = 0; i < N; i++) {
    j = m[j];
    }

    when 'm' is filled with semi-random cyclic pattern this behaves
    quite badly, basically you can start next access only when you
    have result of the previous access. In practice, large 'm'
    seem to produce large number of cache misses for TLB entries.

    Perhaps this distinction is clearer to me due to my background in the
    (hard) disk business. You want lower latency? Make the arm move faster
    or spin the disk faster. You want higher bandwidth? Put more bits on a track or interleave the data across multiple disk heads. And in a
    system, the number of active prefetches is naturally limited by the
    number of disk arms you have.

    That disk analogy is flaved. AFAIK there is no penalty for choosing
    "far away" pages compared to "near" ones (if any the opposite: Row
    Hammer shows that accesses to "near" pages mean that given page may
    need more frequent refresh). In case of memory, time spent in
    memory controller is non-negligible, at least for accesses within
    single page. AFAIK random access to lines withing single page
    costs no more than sequential access, for disk you want sequential
    access to a single track.

    Consumer systems usually had only a single head assembly, while
    even low end systems have multiple prefetch bufffers.
    --
    Waldek Hebisch
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Mar 1 05:39:13 2026
    From Newsgroup: comp.arch

    On 2/28/2026 9:48 AM, Terje Mathisen wrote:
    BGB wrote:
    On 2/24/2026 5:25 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated?-a Tree searching and binary search in arrays come to >>>>> mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and
    bandwidth
    bound.

    If a problem is bandwidth-bound, then differences between conventional
    architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can
    help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems.-a For example, imagine a company employee file
    (table),
    with a (say 300 byte) record for each of its many thousands of
    employees
    each containing typical employee stuff).-a Now suppose someone wants to >>>> know "What is the total salary of all the employees in the "Sales"
    department.-a With no index on "department", but it is at a fixed
    displacement within each record, the code looks at each record, does a >>>> trivial test on it, perhaps adds to a register, then goes to the next
    record.-a This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming
    language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem
    into a latency-bound problem.-a If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of
    pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2
    or L3 cache of modern processors.


    FWIW:

    IME, code with fairly random access patterns to memory, and lots of
    cache misses, is inherently slow; even on big/fancy OoO chips.
    Seemingly about the only real hope the CPU has is to have a large
    cache and just hope that the data happens to be in the cache (and has
    been accessed previously or sufficiently recently) else it is just
    kinda SOL.

    If there is some way that CPU's can guess what memory they need in
    advance and fetch it beforehand, I have not seen much evidence of this
    personally.

    Rather, as can be noted, memory access patterns can often make a
    fairly large impact on the performance of some algorithms.


    Like, for example, decoding a PNG like format vs a JPG like format:
    -a-a PNG decoding typically processes the image as several major phases:
    -a-a-a-a Decompress the Deflate-compressed buffer into memory;
    -a-a-a-a Walk over the image, running scanline filters,
    -a-a-a-a-a-a copying scanlines into a new (output) buffer.

    Could you have a secondary thread that started as soon as one (or a
    small number of) scanline(s) were available, taking advantage of any
    shared $L3 cache to grab the data before it is blown away?


    It is possible to make it piecewise and incremental, but making the
    inflater incremental (and fast) is more complexity and difficulty than
    making a JPEG-like format that is both fast and lossless.

    Though, zlib supports incremental decoding of the type that would be
    useful here, but using zlib this way isn't particularly fast.


    Ironically, if not for cache misses, PNG (and JPEG 2000) would likely be
    some of the faster image codecs around. But, both PNG and JP2K suffer
    from some similar issues here.


    Though, if you stick a Haar of CDF5/3 wavelet in a small fixed-size
    block, it works well (and is faster than if applied over larger
    raster-ordered planes).

    Though, if designing a new codec to optimize for this, could make sense
    to increase the block size from 8x8 to 16x16; with the effective
    macroblock size increased to 32x32.

    Mostly because this is still small enough to probably fit in the L1
    cache on most CPU (assuming one isn't wasting too much memory on the
    entropy coder or similar).

    Though, likely 32x32 blocks would be too big, and would push the decoder outside of "mostly fits in L1 cache" territory.


    My UPIC format stayed with 8x8 blocks though:
    8x8 blocks, 16x16 macroblocks (still 16x16 for 4:4:4).
    When there are four 8x8 blocks, they are encoded in Hilbert order.
    Used:
    STF+AdRice
    Z3V5 VLN's
    Encoded similar to Deflate distances,
    zigzag sign-folding for values.
    Block Transforms (as layered 1D transforms):
    BHT: Block Haar.
    CDF 5/3
    WHT (does poorly on average)
    DCT (lossy only)
    Colorspaces:
    RCT
    YCoCg-R
    YCbCr (Approx, Lossy Only)


    At lossless and high quality:
    BHT and CDF5/3 dominate;
    RCT and YCoCg dominate.

    Where, DCT and YCbCr mostly pull ahead at medium to low quality levels
    and for photo-like images.

    For lossy compression of cartoon-like graphics, CDF 5/3 often wins.


    Have noted:
    Is competitive with T.81 JPEG for compression ratio;
    Is slightly faster for decompression;
    For many lossless images,
    tends to beat PNG for both compression and speed.
    Though for many "very artificial" images:
    Such as for UI graphics,
    PNG still wins for compression.


    Compared with PNG, it has a relative weakness in that it isn't as
    effective with repeating patterns and large flat-colored areas. Both of
    these can benefit more from LZ compression. Though, partly QOI shares
    the same issue.



    Even if the parts, taken in isolation, should be fast:
    -a-a The image buffers are frequently too large to fit in cache;
    -a-a Cache misses tend to make PNG decoding painfully slow,
    -a-a-a-a even when using faster filters.
    -a-a-a-a If using the Paeth filter though, this adds extra slowness,
    -a-a-a-a-a-a due to branch-predictor misses.
    -a-a-a-a-a-a On targets like x86,
    -a-a-a-a-a-a-a-a the filter is frequently implemented using branches;
    -a-a-a-a-a-a-a-a The branch miss rate is very high.
    -a-a-a-a-a-a-a-a So, a naive branching version, performs like dog crap.

    This reminds me of CABAC decoding in h264, where the output of the arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.


    Yeah.

    Making arithmetic and range coders fast is also hard.

    I don't often use them as much because I am not aware of a good way to
    make them fast.



    This is part of why I had often ended up going for STF+AdRice or
    similar, which, while not the best in terms of compression, can be one
    of the faster options in many cases.

    Theoretically, table-driven Huffman could be faster, but likewise often suffers from cache misses (cycles lost to cache misses can outweigh the
    cost of the more complex logic of an AdRice coder).

    Huffman speed can be improved by reducing maximum symbol length and
    table size, but then this can lose much its compression advantage.

    Say, max symbol length:
    10/11: Too short, limits effectiveness.
    12: OK, leans faster;
    13: OK, leans better compression;
    14: Intermediate
    15: Slower still (Deflate is here)
    16: Slower (T.81 JPEG is here)


    Where, for 12/13 bits, the fastest strategy is typically to use a single
    big lookup table for the entropy decoder.

    For 15 or 16 bits, it is often faster to have a separate fast-path and
    slow path. Say, fast path matches on the first 9 or 10 bits, and then
    the slow path falls back to a linear search (over the longer symbols).

    In this case, the relative slowness of falling back to a linear search
    being less than that of the cost of the L1 misses from a bigger lookup
    table.

    The relative loss of Huffman coding efficiency between a 13 bit limit
    and 15 bit limit is fairly modest.




    Where, say:
    AdRice:
    + Can be made reasonably fast and cheap.
    + Low memory footprint;
    + Cheap setup cost;
    - Often weaker than Huffman in terms of compression.
    - Pure AdRice Only deals with certain distributions
    (Requires STF or similar to mimic Huffman's generality).
    Static Huffman:
    + More optimal in terms of coding efficiency;
    - Speed requires initializing bulky lookup tables;
    - High overhead and ineffective for small payloads.
    Range Coding
    + Good for maximizing compression
    + Deals well with small payloads
    - Slow.


    For small data though in some use cases, the relative gains of entropy
    coding relative to raw bytes can become small though.

    Huffman falls first:
    Constant overheads of the Huffman tables themselves become the dominant
    part of the overhead.

    AdRice falls second:
    At a certain limit, it can become ineffective.

    Range falls third:
    Usually the best, but initial adaptation time becomes the bottleneck
    (needs a minimum number of symbols to actually adapt to anything).


    Had noted recently that AdRice's effectiveness can be improved slightly
    at a fairly modest speed cost by slowing the adaptation speed (say, by adjusting by a fraction of a bit each time).

    Say, typical:
    Q=0: Decrement K (if K>0)
    Q=1: Leave K as-is
    Q>=2: Increment K

    Where K is the number of fixed bits following the unary-coded prefix (Q).

    The tweak being to instead adapt a scaled K, say S, then define K=S>>4
    or similar (so, it effectively needs multiple symbols before the value
    of K updates, but increases how often symbols are encoded at the optimal
    value of K).


    As for STF (swap towards front):
    Usual strategy still to swap symbols with whatever is at (15*I/16) or
    similar.

    There are other schemes, but this one has most often ended up winning
    out IME.



    So, net result: Despite its conceptual simplicity, PNG's decode-time
    performance typically sucks.

    Contrast, a decoder for a JPEG like format can be made to process one
    block at a time and go all the way to final output. So, JPEG is often
    faster despite the more complex process (with transform stages and a
    colorspace transform).


    The Paeth filter slowness does seem a little odd though:
    Theoretically, a CPU could turn a short forward branch into predication;
    But, this doesn't tend to be the case.

    It then is faster to turn the filter into some convoluted mess of
    arithmetic and masking in an attempt to reduce the branch mispredict
    costs.

    I would look for a way to handle multiple pixels at once, with SIMD
    code: There the masking/combining is typically the easiest way to
    implement short branches.

    (I might take a look a png decoding at some point)



    #if 0 //naive version, pays a lot for branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
    int p, pa, pb, pc;

    p=a+b-c;
    pa=(p>a)?(p-a):(a-p);
    pb=(p>b)?(p-b):(b-p);
    pc=(p>c)?(p-c):(c-p);

    p=(pa<=pb)?((pa<=pc)?a:c):((pb<=pc)?b:c);
    return(p);
    }
    #endif

    #if 1 //avoid branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
    int p, pa, pb, pc;
    int ma, mb, mc;
    p=a+b-c;
    pa=p-a; pb=p-b; pc=p-c;
    ma=pa>>31; mb=pb>>31; mc=pc>>31;
    pa=pa^ma; pb=pb^mb; pc=pc^mc;
    ma=pb-pa; mb=pc-pb; mc=pc-pa;
    ma=ma>>31; mb=mb>>31; mc=mc>>31;
    p=(ma&((mb&c)|((~mb)&b))) | ((~ma)&((mc&c)|((~mc)&a)));
    return(p);
    }
    #endif

    Where, the Paeth filter is typically the most heavily used filter in PNG decoding (because it tends to be the most accurate), but also the slowest.

    Could in theory be SIMD'ed to maybe work on RGB or RGBA in parallel.


    If someone were designing a new PNG like format, one option could be, say:
    p=(3*a+3*b-2*c)>>2;
    //maybe: clamp p to 0..255 or similar
    Which is "similar, but cheaper".


    Otherwise, could be possible to have a faster PNG like format if the
    format were structured to allow doing everything in a single pass (with
    no separate LZ stage).

    If I were designing it, might also be tempted to use AdRice rather than Huffman.


    ...




    Otherwise:

    Meanwhile, I am mostly starting to question if one might ironically need
    to add a restriction clause to MIT-0 to preserve freedom, say, something
    like:
    This code may not be used in jurisdictions where usage would violate the
    terms of the No Warranty clause or where use of the code would be in
    violation of the laws within that jurisdiction.


    Since as-is, the existing language doesn't offer sufficient protection
    from liability in cases where users use code in violation of laws and
    where said laws hold the vendors or copyright holders liable for
    violation of local laws (say, because the code does not actively prevent
    users from using it in a way which is illegal within their laws, and
    which would be applied to parties outside of the jurisdiction in question).

    While MIT-0 allows for re-licensing, this would shift liability to the
    user of the code for using it in a way that violates local laws.

    Should offer protection except in cases where it could be argued that
    the developers had intended for users to use the code in ways which
    violated laws (which would be harder to prove except in cases where it
    could be argued that the sole intended purpose of the code was to be
    used in a way which violated a law; rather than otherwise benign code
    that was used in a way which violated the law).

    Well, at least in theory.


    Terje


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Mar 1 19:02:18 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 2/28/2026 9:48 AM, Terje Mathisen wrote:
    This reminds me of CABAC decoding in h264, where the output of the
    arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.


    Yeah.

    Making arithmetic and range coders fast is also hard.

    I don't often use them as much because I am not aware of a good way to > make them fast.



    This is part of why I had often ended up going for STF+AdRice or
    similar, which, while not the best in terms of compression, can be one > of the faster options in many cases.

    Theoretically, table-driven Huffman could be faster, but likewise often suffers from cache misses (cycles lost to cache misses can outweigh the
    cost of the more complex logic of an AdRice coder).

    Huffman speed can be improved by reducing maximum symbol length and
    table size, but then this can lose much its compression advantage.

    Say, max symbol length:
    -a 10/11: Too short, limits effectiveness.
    -a 12: OK, leans faster;
    -a 13: OK, leans better compression;
    -a 14: Intermediate
    -a 15: Slower still (Deflate is here)
    -a 16: Slower (T.81 JPEG is here)


    Where, for 12/13 bits, the fastest strategy is typically to use a single
    big lookup table for the entropy decoder.

    For 15 or 16 bits, it is often faster to have a separate fast-path and > slow path. Say, fast path matches on the first 9 or 10 bits, and then
    the slow path falls back to a linear search (over the longer symbols).
    I have looked at multi-level table lookups, where the symbol either is
    the one you want (short codes) or an index into a list of secondary
    tables to be used on the remaining bits.
    When you have many really short codes (think Morse!) , then you can
    profitably decode multiple in a single iteration.

    In this case, the relative slowness of falling back to a linear search > being less than that of the cost of the L1 misses from a bigger lookup > table.

    The relative loss of Huffman coding efficiency between a 13 bit limit
    and 15 bit limit is fairly modest.
    Yeah.
    I would look for a way to handle multiple pixels at once, with SIMD
    code: There the masking/combining is typically the easiest way to
    implement short branches.

    (I might take a look a png decoding at some point)



    #if 0-a //naive version, pays a lot for branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
    -a-a-a-aint p, pa, pb, pc;

    -a-a-a-ap=a+b-c;
    -a-a-a-apa=(p>a)?(p-a):(a-p);
    -a-a-a-apb=(p>b)?(p-b):(b-p);
    -a-a-a-apc=(p>c)?(p-c):(c-p);

    -a-a-a-ap=(pa<=pb)?((pa<=pc)?a:c):((pb<=pc)?b:c);
    -a-a-a-areturn(p);
    }
    #endif

    #if 1-a-a-a //avoid branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
    -a-a-a-aint p, pa, pb, pc;
    -a-a-a-aint ma, mb, mc;
    -a-a-a-ap=a+b-c;
    -a-a-a-apa=p-a;-a-a-a-a-a-a-a pb=p-b;-a-a-a-a-a-a-a pc=p-c;
    -a-a-a-ama=pa>>31;-a-a-a mb=pb>>31;-a-a-a mc=pc>>31;
    -a-a-a-a-a-a-a pa=pa^ma;-a-a-a pb=pb^mb;-a-a-a pc=pc^mc;
    -a-a-a-ama=pb-pa;-a-a-a mb=pc-pb;-a-a-a mc=pc-pa;
    -a-a-a-ama=ma>>31;-a-a-a mb=mb>>31;-a-a-a mc=mc>>31;
    -a-a-a-ap=(ma&((mb&c)|((~mb)&b))) | ((~ma)&((mc&c)|((~mc)&a)));
    -a-a-a-areturn(p);
    }
    #endif

    Where, the Paeth filter is typically the most heavily used filter in PNG decoding (because it tends to be the most accurate), but also the slowest.

    Could in theory be SIMD'ed to maybe work on RGB or RGBA in parallel.
    OK
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Mar 1 21:13:52 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 2/27/2026 1:52 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so >>> the loads can be started right away, up to the limit of memory-level
    parallelism of the hardware. If the records are in RAM, the hardware
    prefetcher can help to avoid running into the scheduler and ROB limits >>> of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited.

    I mentioned several limits. Which one do you have in mind?

    The one you mentioned in your last paragraph, specifically,
    the limit of memory-level parallelism of the hardware.


    Think of it this way. In the current
    situation, increasing the memory system bandwidth, say by hypothetically >> increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand, >> doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance.

    If the CPU is designed to provide enough memory-level parallelism to
    make use of the bandwidth (and that is likely, otherwise why provide
    that much bandwidth), then once the designers spend money on
    increasing the bandwidth, they will also spend the money necessary to increase the MLP.

    No. The memory system throughput depends upon the access pattern. It
    is easier/lower cost to increase the throughput for sequential accesses
    than random (think wider interfaces, cache blocks larger than the amount accessed, etc.)

    It depends on the number of accesses and the ability to absorb the latency.

    For example, say we have a memory system with 256 banks, and a latency of 10-|s, and each access is for a page of memory at 5GB/s.

    A page (4096B) at 5GB/s needs 800ns-#
    So, we need 12.5 Banks to saturate the data channel
    And we have 16 busses each with 16 banks,
    we can sustain 80GB/s
    So, we need in excess of 200 outstanding requests
    Each able to absorb slightly more than 10-|s.

    But this is more like a disk/flash system than main memory.

    Once each bank (of which there are 256) has more than a 3-deep queue
    the BW can be delivered as long as the requests have almost ANY order
    other than targeting 1 (or few) bank(s).

    But what you say is TRUE when one limits the interpretation of modern
    CPUs, but not when one limits themselves to applications running on
    modern CPUs requesting pages from long term storage. {how do you think
    Data Based work??}

    But optimization for sequential workloads can actually
    hurt performance for random workloads, e.g. larger block sizes reduce
    the number of accesses for sequential workloads, but each access takes longer, thus hurting random workloads. So

    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    designers aim to maximize the throughput, subject to cost and technology constraints, for some mix of sequential (bandwidth) versus random (latency) access.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Mar 1 18:05:23 2026
    From Newsgroup: comp.arch

    On 3/1/2026 12:02 PM, Terje Mathisen wrote:
    BGB wrote:
    On 2/28/2026 9:48 AM, Terje Mathisen wrote:
    This reminds me of CABAC decoding in h264, where the output of the
    arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.


    Yeah.

    Making arithmetic and range coders fast is also hard.

    I don't often use them as much because I am not aware of a good way to
    make them fast.



    This is part of why I had often ended up going for STF+AdRice or
    similar, which, while not the best in terms of compression, can be one
    of the faster options in many cases.

    Theoretically, table-driven Huffman could be faster, but likewise
    often suffers from cache misses (cycles lost to cache misses can
    outweigh the cost of the more complex logic of an AdRice coder).

    Huffman speed can be improved by reducing maximum symbol length and
    table size, but then this can lose much its compression advantage.

    Say, max symbol length:
    -a-a 10/11: Too short, limits effectiveness.
    -a-a 12: OK, leans faster;
    -a-a 13: OK, leans better compression;
    -a-a 14: Intermediate
    -a-a 15: Slower still (Deflate is here)
    -a-a 16: Slower (T.81 JPEG is here)


    Where, for 12/13 bits, the fastest strategy is typically to use a
    single big lookup table for the entropy decoder.

    For 15 or 16 bits, it is often faster to have a separate fast-path and
    slow path. Say, fast path matches on the first 9 or 10 bits, and then
    the slow path falls back to a linear search (over the longer symbols).

    I have looked at multi-level table lookups, where the symbol either is
    the one you want (short codes) or an index into a list of secondary
    tables to be used on the remaining bits.


    Can work OK if one assumes all of the longer codes are prefixed by a
    longish series of 1s, which is usually but not necessarily true.

    Probably more true of a Deflate style table though, where sending the
    table as an array of symbol lengths does limit the configurations it can
    take.


    When you have many really short codes (think Morse!) , then you can profitably decode multiple in a single iteration.


    Typical in my case of both shorter length-limited Huffman, and of Rice
    coding.


    In the case of Rice, it can often make sense to use a lookup table to
    decode the Q prefix.

    Say (pseudocode):
    b=PeekBits()
    q=qtab[b&255];
    if(q>=8)
    { skb=16; v=(b>>8)&255; }
    else
    { skb=q+k+1; v=(q<<k)|((b>>(q+1))&((1<<k)-1)); }
    SkipBits(skb);




    In this case, the relative slowness of falling back to a linear search
    being less than that of the cost of the L1 misses from a bigger lookup
    table.

    The relative loss of Huffman coding efficiency between a 13 bit limit
    and 15 bit limit is fairly modest.

    Yeah.


    Some of my designs where I stuck with Huffman had ended up going over to
    a 13 bit limit partly for this reason, as in this case, the lookup table fitting in the L1 cache ends up as a win.


    Though, a factor is the number of tables in use; something more like
    JPEG where one has 4 of them (Y-DC, Y-AC, UV-DC, UV-AC); this still
    isn't going to fit in the L1 cache.


    So, errm, partial win here for STF+AdRice.

    Though, one could argue:
    Why use Rice for encoding the coefficient values as VLNs vs just
    encoding the values directly as Rice? Well, simple answer is mostly that
    this sucks.


    As noted, one might want to encode symbols that encode both a zero run
    and a value. JPEG had used Z4V4, with the value directly encoding the
    number of bits. The scheme used by JPEG was comparably space-inefficient though; and the general scheme used by Deflate was more space-efficient.

    So:
    JPEG scheme:
    Read V bits;
    Sign extend these bits for the full coefficient.
    val=ReadBits(v);
    sh=31-v;
    val=((s32)((val)<<sh))>>sh;
    Deflate Inspired:
    if(v>=4)
    { h=(v>>1)-1; val=((2|(v&1))<<h)|ReadBits(h); }
    else
    { val=v; }
    val=(val>>1)^(((s32)(val<<31))>>31);

    where, value table (unsigned) looks sorta like:
    pfx extra value
    0/1 0 0 / 1
    2/3 0 2 / 3
    4/5 1 4.. 7
    6/7 2 8..15
    8/9 3 16..31
    ...
    With the LSB then encoding the sign, so:
    0, -1, 1, -2, 2, ...


    Though, as can be noted, the loss of one bit for Z means that the
    maximum run of zeroes per symbol is shorter.

    The main difference is mostly that the JPEG scheme costs roughly 1 bit
    more per VLN (typical), or 2 bits more for small values (-1 and 1 need 2
    extra with the JPEG scheme).


    Though, despite the more efficient VLN scheme, UPIC does lose some
    entropic efficiency with its use of STF+AdRice here rather than Huffman.


    I would look for a way to handle multiple pixels at once, with SIMD
    code: There the masking/combining is typically the easiest way to
    implement short branches.

    (I might take a look a png decoding at some point)



    #if 0-a //naive version, pays a lot for branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
    -a-a-a-a-aint p, pa, pb, pc;

    -a-a-a-a-ap=a+b-c;
    -a-a-a-a-apa=(p>a)?(p-a):(a-p);
    -a-a-a-a-apb=(p>b)?(p-b):(b-p);
    -a-a-a-a-apc=(p>c)?(p-c):(c-p);

    -a-a-a-a-ap=(pa<=pb)?((pa<=pc)?a:c):((pb<=pc)?b:c);
    -a-a-a-a-areturn(p);
    }
    #endif

    #if 1-a-a-a //avoid branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
    -a-a-a-a-aint p, pa, pb, pc;
    -a-a-a-a-aint ma, mb, mc;
    -a-a-a-a-ap=a+b-c;
    -a-a-a-a-apa=p-a;-a-a-a-a-a-a-a pb=p-b;-a-a-a-a-a-a-a pc=p-c;
    -a-a-a-a-ama=pa>>31;-a-a-a mb=pb>>31;-a-a-a mc=pc>>31;
    -a-a-a-a-a-a-a-a pa=pa^ma;-a-a-a pb=pb^mb;-a-a-a pc=pc^mc;
    -a-a-a-a-ama=pb-pa;-a-a-a mb=pc-pb;-a-a-a mc=pc-pa;
    -a-a-a-a-ama=ma>>31;-a-a-a mb=mb>>31;-a-a-a mc=mc>>31;
    -a-a-a-a-ap=(ma&((mb&c)|((~mb)&b))) | ((~ma)&((mc&c)|((~mc)&a)));
    -a-a-a-a-areturn(p);
    }
    #endif

    Where, the Paeth filter is typically the most heavily used filter in
    PNG decoding (because it tends to be the most accurate), but also the
    slowest.

    Could in theory be SIMD'ed to maybe work on RGB or RGBA in parallel.

    OK



    As an idle thought for a format "sorta PNG-like", but meant to be faster
    to decode (though, aiming to be closer to PNG than, say, QOI).


    Tag symbols encode commands, say:
    01..0F: 1..15 pixels, Delta per Pixel, useoffset=0;
    11..1F: 1..15 pixels, Delta per pixel, useOffset=1;
    21..2F: 1..15 pixels, Delta is 0, useOffset=0;
    30..3F: 1..15 pixels, Delta is 0, useOffset=1;
    41..4F: 1..15 pixels, Single Delta, Applied Every Pixel
    51..5F: 1..15 pixels, Single Delta, Applied One Pixel

    00/10/20/30/40/50: 16+ pixels
    Length follows, encoded as a VLN.
    Otherwise behaves the same as the corresponding 1-15 case.

    60..6F: -
    70..7F: -
    80..8F: -
    90..9F: -
    A0..AF: Single Delta of a given Type, useOffset=0;
    B0..BF: Single Delta of a given Type, useOffset=1;
    C0..CF: Set Predictor Function, useOffset=0;
    D0..DF: -
    E0..FF: Set Predictor Offset, useOffset=1;

    So, in this case:
    useOffset==0:
    Predict pixel values in a similar way to PNG.
    Adjacent pixels are used with predictor.
    useOffset==1:
    The Offset gives at a previous location in the image
    Deltas relative to the pixels at this offset.
    Offset is in raster space, must point within image.
    Encoded in a similar way to a Deflate distance.
    Always relative to the current position in raster space.
    The offset pixel is used as the prediction.

    So, setting an offset and then doing a run of delta==0 pixels
    effectively just copies the prior pixels. Otherwise, deltas are applied relative to those pixels.


    As in PNG, the deltas would likely be mod-256.
    Each delta point would be encoded as a symbol.


    Predictor functions:
    0: Special, restore last predictor
    1: Last Value
    2: Left
    3: Up
    4: Average of Left and Up
    6: (3*A+3*B-2*C)/4
    7: Paeth (Possible, but slower option)

    Delta Types:
    0: Special, last delta type
    1: dY, dR=dG=dB=dY, dA=0
    2: dRGB, dA=0
    3: dRGBA
    4: dYUV, dA=0 (encodes dRGB as RCT)
    5: dYUVA


    Would likely have STF+AdRice contexts for:
    Tag/Command Bytes
    Delta Bytes
    Length VLN prefix values.


    Downsides:
    This design as-is in not particularly elegant;
    Would exist awkwardly in a part of the design space between PNG and QOI;
    Would have a comparably more complex encoder.

    Would be considered as failing if:
    Compression is significantly worse than PNG;
    Fails to beat PNG at decode speed.

    The more complex stream representation is partly to compensate for not
    having an LZ stage.

    ...


    Would likely make sense to have the various configurations as function pointers, but this is not ideal (both due to code bulk and
    function-pointer overheads). But, function pointers are likely to be
    faster than decision trees in this case.


    Then again, might just be better to figure out some effective way to
    have an incremental stream-decoded inflater...


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Mar 2 02:03:10 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 3/1/2026 12:02 PM, Terje Mathisen wrote:
    BGB wrote:
    On 2/28/2026 9:48 AM, Terje Mathisen wrote:
    This reminds me of CABAC decoding in h264, where the output of the
    arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.


    Yeah.

    Making arithmetic and range coders fast is also hard.

    I don't often use them as much because I am not aware of a good way to
    make them fast.



    This is part of why I had often ended up going for STF+AdRice or
    similar, which, while not the best in terms of compression, can be one
    of the faster options in many cases.

    Theoretically, table-driven Huffman could be faster, but likewise
    often suffers from cache misses (cycles lost to cache misses can
    outweigh the cost of the more complex logic of an AdRice coder).

    Huffman speed can be improved by reducing maximum symbol length and
    table size, but then this can lose much its compression advantage.

    Say, max symbol length:
    -a-a 10/11: Too short, limits effectiveness.
    -a-a 12: OK, leans faster;
    -a-a 13: OK, leans better compression;
    -a-a 14: Intermediate
    -a-a 15: Slower still (Deflate is here)
    -a-a 16: Slower (T.81 JPEG is here)


    Where, for 12/13 bits, the fastest strategy is typically to use a
    single big lookup table for the entropy decoder.

    For 15 or 16 bits, it is often faster to have a separate fast-path and
    slow path. Say, fast path matches on the first 9 or 10 bits, and then
    the slow path falls back to a linear search (over the longer symbols).

    I have looked at multi-level table lookups, where the symbol either is
    the one you want (short codes) or an index into a list of secondary
    tables to be used on the remaining bits.


    Can work OK if one assumes all of the longer codes are prefixed by a
    longish series of 1s, which is usually but not necessarily true.

    In HW any pattern can be used. In SW only patterns that are almost
    satisfied by the current ISA can be considered. Big difference.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Mar 2 17:12:53 2026
    From Newsgroup: comp.arch

    On 2/28/2026 1:49 PM, Waldek Hebisch wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level
    parallelism of the hardware. If the records are in RAM, the hardware
    prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited. Think of it this way. In the current
    situation, increasing the memory system bandwidth, say by hypothetically
    increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand,
    doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance. To me, that is the definition
    of being latency bound, not bandwidth bound.

    I agree with your definition, but my prediction is somewhat different.
    First, consider silly program that goes sequentially over larger
    array accessing all lines. AFAICS it you should see tiny effect when
    program uses only one byte from each line compared to using whole
    line. Now consider variant that accesses every fifth line.
    There are differences, one that prefetcher needs to realize that
    there is no need to prefetch intermediate lines. Second difference
    is that one can fetch lines quickly only when they are on a single
    page. Having "step 5" on lines means 5 times as many page crossings.
    I do not know how big are pages in modern DRAM, but at step large
    enough you will see significant delay due to page crossing. I
    would tend to call this delay "latency", but it is somewhat
    murky. Namely, with enough prefetch and enough memory banks
    you can still saturate a single channel to the core (assuming
    that there are many cores, many channels from memory controller
    to memory banks but only single channel between memory controller
    and each core. Of course, modern system tend to have limited
    number of memory banks, so the argument above is purely theoretical.

    Somewhat different case is when there are independent loads from
    random locations, something like

    for(i = 0; i < N; i++) {
    s += m[f(i)];
    }

    where 'f' is very cheap to compute, but hard to predict by the
    hardware. In case above reorder buffer and multiple banks helps,
    but even with unlimited CPU resurces maximal number of accesses
    is number of memory banks divided by access time of single bank
    (that is essentialy latency of memory array).

    Then there is pointer chasing case, like

    for(i = 0; i < N; i++) {
    j = m[j];
    }

    when 'm' is filled with semi-random cyclic pattern this behaves
    quite badly, basically you can start next access only when you
    have result of the previous access. In practice, large 'm'
    seem to produce large number of cache misses for TLB entries.

    Well, the examples I gave got confusing (my fault) because, as Anton
    pointed out, the table I used would fit into L3 cache on many modern
    systems. So this all got tied up in the difference between in cache
    results and in DRAM results. I don't disagree with your points, but
    they are tangential to the point I was trying to make.

    Let me try again. Suppose you had a (totally silly) program with a 2 GB array, and you used a random number generate an address within it, then
    added the value at that addressed byte to an accumulator. Repeat say
    10,000 times. I would call this program latency bound, but I suspect
    Anton would call it bandwidth bound. If that is true, then that
    explains the original differences Anton and I had.


    Perhaps this distinction is clearer to me due to my background in the
    (hard) disk business. You want lower latency? Make the arm move faster
    or spin the disk faster. You want higher bandwidth? Put more bits on a
    track or interleave the data across multiple disk heads. And in a
    system, the number of active prefetches is naturally limited by the
    number of disk arms you have.

    That disk analogy is flaved. AFAIK there is no penalty for choosing
    "far away" pages compared to "near" ones (if any the opposite: Row
    Hammer shows that accesses to "near" pages mean that given page may
    need more frequent refresh). In case of memory, time spent in
    memory controller is non-negligible, at least for accesses within
    single page. AFAIK random access to lines withing single page
    costs no more than sequential access, for disk you want sequential
    access to a single track.

    Yes, but these are second order compared to the difference between
    latency and bandwidth on a disk.

    This whole think has spiraled into something far beyond what I expected,
    and what, I think, is useful. So unless you want me to, I probably
    won't respond further.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Mar 3 02:34:28 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 2/28/2026 1:49 PM, Waldek Hebisch wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    ----------------------
    Somewhat different case is when there are independent loads from
    random locations, something like

    for(i = 0; i < N; i++) {
    s += m[f(i)];
    }

    where 'f' is very cheap to compute, but hard to predict by the
    hardware. In case above reorder buffer and multiple banks helps,
    but even with unlimited CPU resurces maximal number of accesses
    is number of memory banks divided by access time of single bank
    (that is essentialy latency of memory array).

    Then there is pointer chasing case, like

    for(i = 0; i < N; i++) {
    j = m[j];
    }

    when 'm' is filled with semi-random cyclic pattern this behaves
    quite badly, basically you can start next access only when you
    have result of the previous access. In practice, large 'm'
    seem to produce large number of cache misses for TLB entries.

    Well, the examples I gave got confusing (my fault) because, as Anton
    pointed out, the table I used would fit into L3 cache on many modern systems. So this all got tied up in the difference between in cache
    results and in DRAM results. I don't disagree with your points, but
    they are tangential to the point I was trying to make.

    Let me try again. Suppose you had a (totally silly) program with a 2 GB array, and you used a random number generate an address within it, then added the value at that addressed byte to an accumulator. Repeat say
    10,000 times. I would call this program latency bound, but I suspect
    Anton would call it bandwidth bound. If that is true, then that
    explains the original differences Anton and I had.

    This can be limited by the calculation latency of the RNG. A 1-cycle
    RNG would allow for as many memory references as the CPU allows to be
    in progress simultaneously. A K/cycle RNG would simply saturate the
    miss buffer sooner.

    Your typical 'Multiply and add with table indirection' RNG is about the
    latency of L2-hit or a bit longer; and way smaller than L3-hit.

    When RNG is longer than L3 latency--you will not be memory bound.


    Perhaps this distinction is clearer to me due to my background in the
    (hard) disk business. You want lower latency? Make the arm move faster >> or spin the disk faster. You want higher bandwidth? Put more bits on a
    track or interleave the data across multiple disk heads. And in a
    system, the number of active prefetches is naturally limited by the
    number of disk arms you have.

    That disk analogy is flaved. AFAIK there is no penalty for choosing
    "far away" pages compared to "near" ones (if any the opposite: Row
    Hammer shows that accesses to "near" pages mean that given page may
    need more frequent refresh). In case of memory, time spent in
    memory controller is non-negligible, at least for accesses within
    single page. AFAIK random access to lines withing single page
    costs no more than sequential access, for disk you want sequential
    access to a single track.

    Yes, but these are second order compared to the difference between
    latency and bandwidth on a disk.

    Or a RIAD of disks--but that is only of importance when one has enough
    disks for the "several ms" to be amortized by the "data occupancy" on
    the bus. {Little's Law} And in practical systems on needs on the order
    of several hundreds of disks for that to be manifest.

    This whole think has spiraled into something far beyond what I expected,
    and what, I think, is useful. So unless you want me to, I probably
    won't respond further.




    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Mar 3 04:24:14 2026
    From Newsgroup: comp.arch

    On 3/1/2026 8:03 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 3/1/2026 12:02 PM, Terje Mathisen wrote:
    BGB wrote:
    On 2/28/2026 9:48 AM, Terje Mathisen wrote:
    This reminds me of CABAC decoding in h264, where the output of the
    arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.


    Yeah.

    Making arithmetic and range coders fast is also hard.

    I don't often use them as much because I am not aware of a good way to >>>> make them fast.



    This is part of why I had often ended up going for STF+AdRice or
    similar, which, while not the best in terms of compression, can be one >>>> of the faster options in many cases.

    Theoretically, table-driven Huffman could be faster, but likewise
    often suffers from cache misses (cycles lost to cache misses can
    outweigh the cost of the more complex logic of an AdRice coder).

    Huffman speed can be improved by reducing maximum symbol length and
    table size, but then this can lose much its compression advantage.

    Say, max symbol length:
    -a-a 10/11: Too short, limits effectiveness.
    -a-a 12: OK, leans faster;
    -a-a 13: OK, leans better compression;
    -a-a 14: Intermediate
    -a-a 15: Slower still (Deflate is here)
    -a-a 16: Slower (T.81 JPEG is here)


    Where, for 12/13 bits, the fastest strategy is typically to use a
    single big lookup table for the entropy decoder.

    For 15 or 16 bits, it is often faster to have a separate fast-path and >>>> slow path. Say, fast path matches on the first 9 or 10 bits, and then
    the slow path falls back to a linear search (over the longer symbols).

    I have looked at multi-level table lookups, where the symbol either is
    the one you want (short codes) or an index into a list of secondary
    tables to be used on the remaining bits.


    Can work OK if one assumes all of the longer codes are prefixed by a
    longish series of 1s, which is usually but not necessarily true.

    In HW any pattern can be used. In SW only patterns that are almost
    satisfied by the current ISA can be considered. Big difference.

    I guess it is possible someone could define hardware logic to support
    Huffman coding, but then again, it would be even easier to define
    hardware support for Rice coding.

    Though, this could range from more generic, like a CTNZ instruction
    (Count Trailing Non-Zero) to maybe more specialized instructions.

    Big downside for Huffman in HW is that almost invariably it would
    require big lookup tables, wheres Rice coding could mostly be done with
    fixed logic (and/or more generic instructions).


    Like, often the main lookup table used for Rice-decoding is just to do
    the equivalent of a CTNZ operation.

    Could integrate things more, but this would likely get into the
    territory of needing instructions with multiple destination registers or similar (and/or some sort of state-containing architectural feature).

    ...



    Meanwhile, did a quick mock-up of the Rice-coded vaguely PNG-like format mentioned a previously (calling it UNG for now), but thus far it is
    kinda looking like a turd.


    Comparing a few formats (with a photo-like image, 1024x688, lossless or
    max quality; speeds on my desktop PC):
    UPIC: 488K, 41.5 Mpix/s
    UNG : 1060K, 21.6 Mpix/s
    JPG : 410K, 25.3 Mpix/s (Q=100, inexact)
    PNG : 795K, 12.7 Mpix/s
    QOI : 1036K, 76.2 Mpix/s

    So, UPIC gives nearly JPEG-like compression while being lossless and
    faster than either JPEG or PNG.


    QOI wins for speed, but not compression (it is a byte-oriented format).
    It is fast, but in my own testing its compression still often loses to
    PNG (despite claims that it beats PNG).


    And, my UNG test was kind of a fail thus far, having a QOI-like
    file-size with closer to PNG-like speeds.

    Well, and the use of entropy coding is not necessarily a win though, if
    the design still turns out to be kinda a turd...



    UNG could maybe be improved with more fiddling, but thus far this is not
    a good start.


    It looks like UPIC is still in the lead here. Initially, it was mostly
    just focused on speed (and ability to support a lossless mode) but
    turned out to also be pretty solid for compression as well.

    Also maybe ironic that the Block Haar Transform is both faster than DCT
    and also still fairly effective as a block transform (and lossless). Not
    sure why Block-Haar is not more popular.

    Note that both UPIC and JPG see a pretty big speed boost when used lossy compression here (and, among the formats tested, JPG is the closest
    direct analog to UPIC among the mainline formats).


    Then again, maybe I need to test with highly-compressible synthetic
    images (like UI graphics), as this is typically where PNG holds a strong
    lead. And, I didn't really come up with the UNG design with photos in
    mind (even if they make sense as an initial test case).

    Would likely be a worse option for texture-maps than UPIC.
    Likely has no real advantage over indexed-color BMP for the use-cases
    where indexed-color BMP makes sense (and thus far the best compression
    method for indexed-color graphics seems to be to use LZ compression over
    the indexed color graphics).



    Still kinda annoying sometimes that it seems like stuff like this can be
    a bit hit or miss, one doesn't always know in advance whether something
    will work well or turn out to just kinda suck (well, and sometimes
    things that seem to suck initially can pull ahead with some more polishing).

    ...


    --- Synchronet 3.21d-Linux NewsLink 1.2