In a way, it showed that they screwed up the design pretty hard
that x86-64 ended up being the faster and more efficient option...
They did. They really did.
I guess one question is if they had any other particular drawbacks
other than, say:
Their code density was one of the worst around;
128 registers is a little excessive;
128 predicate register bits is a bit WTF;
They
had two much bigger problems, though.
They'd correctly understood that the low speed of affordable dynamic RAM
as compared to CPUs running at hundreds of MHz was the biggest barrier to >making code run fast.
Their solution was have the compiler schedule loads
well in advance. They assumed, without evidence, that a compiler with
plenty of time to think could schedule loads better than hardware doing
it dynamically. It's an appealing idea, but it's wrong.
It might be possible to do that effectively in a single-core,
single-thread, single-task system that isn't taking many (if any)
interrupts. In a multi-core system, running a complex operating system, >several multi-threaded applications, and taking frequent interrupts and >context switches, it is _not possible_. There is no knowledge of any of
the interrupts, context switches or other applications at compile time,
so the compiler has no idea what is in cache and what isn't.
Speculative execution addresses that problem quite effectively.
We don't
have a better way, almost thirty years after Itanium design decisions
were taken. They didn't want to do speculative execution
and they close
an instruction format and register set that made adding it later hard.
The other problem was that they had three (or six, or twelve) in-order >pipelines running in parallel. That meant the compilers had to provide
enough ILP to keep those pipelines fed, or they'd just eat cache capacity
and memory bandwidth executing no-ops ...
They didn't have a general way to extract enough ILP.
I guess it is more of an open question of what would have happened,
say, if Intel had gone for an ISA design more like ARM64 or RISC-V
or something.
ARM64 seems to me to be the product of a lot more experience with >speculatively-executing processors than was available in 1998.
RISC-V has
not demonstrated really high performance yet, and it's been around long >enough that I'm starting to doubt it ever will.
jgd@cix.co.uk (John Dallman) writes:----------------
[some unattributed source writes:]
Actually, other architectures also added prefetching instructions for
dealing with that problem. All I have read about that was that there
were many disappointments when using these instructions. I don't know
if there were any successes, and how frequent they were compared to
the disappointments. So I don't see that IA-64 was any different from
other architectures in that respect.
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
----------------------------
I have certainly read about interesting results for binary search (in
an array) where the branching version outperforms the branchless
version because the branching version speculatively executes several
followup loads, and even in unpredictable cases the speculation should
be right in 50% of the cases, resulting in an effective doubling of memory-level parallelism (but at a much higher cost in memory
subsystem load). But other than that, I don't see that speculative
execution helps with memory latency.
The instruction format makes no difference. Having so many registers
may have mad it harder than otherwise, but SPARC also used many
registers, and we have seen OoO implementations (and discussed them a
few months ago). The issue is that speculative execution and OoO
makes all the EPIC features of IA-64 unnecessary, so if they cannot do
a fast in-order implementation of IA-64 (and they could not), they
should just give up and switch to an architecture without these
features, such as AMD64. And Intel did, after a few years of denying.
The remainder of IA-64 was just to keep it alive for the customers who
had bought into it.
For a few cases, they have. But the problem is that these cases are
also vectorizable.
Their major problem was that they did not get enough ILP from
general-purpose code.
And even where they had enough ILP, OoO CPUs with SIMD extensions ate
their lunch.
- anton--- Synchronet 3.21b-Linux NewsLink 1.2
Stefan Monnier <monnier@iro.umontreal.ca> writes:
At the time of conception, there were amny arguments that {sooner or
later} compilers COULD figure stuff like this out.
I can't remember seeing such arguments comping from compiler people, tho.
Actually, the IA-64 people could point to the work on VLIW (in
particular, Multiflow (trace scheduling) and Cydrome (software
pipelining)), which in turn is based on the work on compilers for
microcode.
But as computer hardware got faster and denser, it became possible to
do the scheduling on the fly in hardware, so you could get comparable >performance with conventional instruction sets in a microprocessor.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:[1) stride-based. 2) pointer-chasing]
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
Array and Matrix scientific codes with datasets bigger than cache.
I have certainly read about interesting results for binary search (in
an array) where the branching version outperforms the branchless
version because the branching version speculatively executes several
followup loads, and even in unpredictable cases the speculation should
be right in 50% of the cases, resulting in an effective doubling of
memory-level parallelism (but at a much higher cost in memory
subsystem load). But other than that, I don't see that speculative
execution helps with memory latency.
At the cost of opening the core up to Spectr|--like attacks.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:[1) stride-based. 2) pointer-chasing]
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come
to mind, but are they really common, apart from programming
classes?
Array and Matrix scientific codes with datasets bigger than cache.
The dense cases are covered by stride-based hardware predictors, so
they are not "otherwise". I am not familiar enough with sparse
scientific codes to comment on whether they are 1), 2), or
"otherwise".
The recent comparisons of branchy vs branchless binary search that weI have certainly read about interesting results for binary search
(in an array) where the branching version outperforms the
branchless version because the branching version speculatively
executes several followup loads, and even in unpredictable cases
the speculation should be right in 50% of the cases, resulting in
an effective doubling of memory-level parallelism (but at a much
higher cost in memory subsystem load). But other than that, I
don't see that speculative execution helps with memory latency.
At the cost of opening the core up to Spectr_--like attacks.
There may be a way to avoid the side channel while still supporting
this scenario. But I think that there are better ways to speed up
such a binary search:
Here software prefetching can really help: prefetch one level ahead (2 prefetches), or two levels ahead (4 prefetches), three levels (8
prefetches), or four levels (16 prefetches), etc., whatever gives the
best performance (which may be hardware-dependent). The result is a
speedup of the binary search by (in the limit) levels+1.
By contrast, the branch prediction "prefetching" provides a factor 1.5
at twice the number of loads when one branch is predicted, 1.75 at 3x
the number of loads when two branches are predicted, etc. up to a
speedup factor of 2 with an infinite of loads and predicted branches;
that's for completely unpredictable lookups, with some predictability
the branch prediction approach performs better, and with good
predictability it should outdo the software-prefetching approach for
the same number of additional memory accesses.
- anton
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
At the time of conception, there were amny arguments that {sooner or
later} compilers COULD figure stuff like this out.
I can't remember seeing such arguments comping from compiler people, tho.
Actually, the IA-64 people could point to the work on VLIW (in
particular, Multiflow (trace scheduling) and Cydrome (software >pipelining)), which in turn is based on the work on compilers for >microcode.
I knew the Multiflow people pretty well when I was at Yale. Trace
scheduling was inspired by the FPS AP-120B, which had wide
instructions issuing multiple operations and was extremely hard to
program efficiently.
Multiflow's compiler worked pretty well and did a good job of static scheduling memory operations when the access patterns weren't too data dependent. It was good enough that Intel and HP both licensed it and
used it in their VLIW projects. It was a good match for the hardware
in the 1980s.
But as computer hardware got faster and denser, it became possible to
do the scheduling on the fly in hardware, so you could get comparable performance with conventional instruction sets in a microprocessor.
particular, Multiflow (trace scheduling) and Cydrome (software
pipelining)), which in turn is based on the work on compilers for
microcode.
Of course, compiler people have worked on such problems and solved some >cases. But what I wrote above is that "I can't remember seeing
... compiler people" claiming that "{sooner or later} compilers COULD
figure stuff like this out".
Apropos another thread I can believe that IA-64 was obsolete before it was shipped
for that reason, static scheduling will never keep up with dynamic except in >applications where the access patterns are predictable.
Are there enough applications like that to make VLIWs worth it? Some kinds of DSP?
John Levine <johnl@taugh.com> writes:
Apropos another thread I can believe that IA-64 was obsolete before it was shipped
for that reason, static scheduling will never keep up with dynamic except in >>applications where the access patterns are predictable.
Concerning the scheduling, hardware scheduling looked pretty dumb at
the time (always schedule the oldest ready instruction(s)), ...
Another aspect where hardware is far superior is branch prediction.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
John Levine <johnl@taugh.com> writes:
Apropos another thread I can believe that IA-64 was obsolete before it was shipped
for that reason, static scheduling will never keep up with dynamic except in >>>applications where the access patterns are predictable.
Concerning the scheduling, hardware scheduling looked pretty dumb at
the time (always schedule the oldest ready instruction(s)), ...
I was thinking of memory scheduling. You have multiple banks of memory
each of which can only do one fetch or store at a time, and the goal
was to keep all of the banks as busy as possible. If you're accessing
an array in predictable order, trace scheduling works well, but if
you're fetching a[b[i]] where b varies at runtime, it doesn't.
Another aspect where hardware is far superior is branch prediction.
I gather speculative execution of both branch paths worked OK if the
branch tree wasn't too bushy.
There were certainly ugly details, e.g.,
if there's a trap on a path that turns out not to be taken.
On Sun, 22 Feb 2026 13:17:30 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
[1) stride-based. 2) pointer-chasing]
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted: =20
The dense cases are covered by stride-based hardware predictors, soOtherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come
to mind, but are they really common, apart from programming
classes? =20
Array and Matrix scientific codes with datasets bigger than cache. =20 >>=20
they are not "otherwise". I am not familiar enough with sparse
scientific codes to comment on whether they are 1), 2), or
"otherwise".
=20
BLAS Level 3 is not particularly external/LLC bandwidth intensive even >without hardware predictors.
Overwhelming majority of data served from
L2 cache.
That's with classic SIMD. It's possible that with AMX units it's no
longer true.
The recent comparisons of branchy vs branchless binary search that we
carried on RWT forum seems to suggest that on modern CPUs branchless
variant is faster even when the table does not fit in LLC.=20
Branchy variant manages to pull ahead only when TLB misses can't be
served from L2$.
At least that how I interpreted it.
Here is result on very modern CPU: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223974
And here is older gear: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223895
Michael S <already5chosen@yahoo.com> writes:
On Sun, 22 Feb 2026 13:17:30 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
[1) stride-based. 2) pointer-chasing]
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted: =20
=20Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays
come to mind, but are they really common, apart from programming
classes? =20
Array and Matrix scientific codes with datasets bigger than
cache. =20
The dense cases are covered by stride-based hardware predictors, so
they are not "otherwise". I am not familiar enough with sparse
scientific codes to comment on whether they are 1), 2), or
"otherwise".
=20
BLAS Level 3 is not particularly external/LLC bandwidth intensive
even without hardware predictors.
There are HPC applications that are bandwidth limited; that's why they
have the roofline performance model (e.g., <https://docs.nersc.gov/tools/performance/roofline/>).
Of course for
the memory-bound applications the uarch style is not particularly
important, as long as it allows to exploit the memory-level
parallelism in some way; prefetching (by hardware or software) is a
way that can be performed by any kind of uarch to exploit MLP.
IA-64 implementations have performed relatively well for SPECfp. I
don't know how memory-bound these applications were, though.
For those applications that benefit from caches (e.g., matrix multiplication), memory-level parallelism is less important.
Overwhelming majority of data served from
L2 cache.
That's with classic SIMD. It's possible that with AMX units it's no
longer true.
I very much doubt that. There is little point in adding an
instruction that slows down execution by turning it from compute-bound
to memory-bound.
The recent comparisons of branchy vs branchless binary search that we >carried on RWT forum seems to suggest that on modern CPUs branchless >variant is faster even when the table does not fit in LLC.=20
Only two explanations come to my mind:
1) The M3 has a hardware prefetcher that recognizes the pattern of a
binary array search and prefetches accordingly. The cache misses from
page table accesses might confuse the prefetcher, leading to worse performance eventually.
2) (doubtful) The compiler recognizes the algorithm and inserts
software prefetch instructions.
Branchy variant manages to pull ahead only when TLB misses can't be
served from L2$.
At least that how I interpreted it.
Here is result on very modern CPU: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223974
And here is older gear: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223895
I tried to run your code <https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
on Zen4, but clang-14 converts uut2.c into branchy code. I could not
get gcc-12 to produce branchless code from slightly adapted source
code, either. My own attempt of using extended asm did not pass your
sanity checks, so eventually I used the assembly code produced by
clang-19 through godbolt.
Here I see that branchy is faster for the
100M array size on Zen4 (i.e., where on the M3 branchless is faster):
[~/binary-search:165562] perf stat branchless 100000000 10000000 10 4041.093082 msec. 0.404109 usec/point
Performance counter stats for 'branchless 100000000 10000000 10':
45436.88 msec task-clock 1.000 CPUs utilized
276 context-switches 6.074 /sec
0 cpu-migrations 0.000 /sec
1307 page-faults 28.765 /sec 226091936865 cycles 4.976 GHz
1171349822 stalled-cycles-frontend 0.52% frontend cycles idle 34666912723 instructions 0.15 insn per cycle
0.03 stalled cycles per insn
5829461905 branches 128.298 M/sec
45963810 branch-misses 0.79% of all branches
45.439541729 seconds time elapsed
45.397207000 seconds user
0.040008000 seconds sys
[~/binary-search:165563] perf stat branchy 100000000 10000000 10
3051.269998 msec. 0.305127 usec/point
Performance counter stats for 'branchy 100000000 10000000 10':
34308.48 msec task-clock 1.000 CPUs utilized
229 context-switches 6.675 /sec
0 cpu-migrations 0.000 /sec
1307 page-faults 38.096 /sec 172472673652 cycles 5.027 GHz
49176146462 stalled-cycles-frontend 28.51% frontend cycles idle 33346311766 instructions 0.19 insn per cycle
1.47 stalled cycles per insn
10322173655 branches 300.864 M/sec
1583842955 branch-misses 15.34% of all branches
34.311432848 seconds time elapsed
34.264437000 seconds user
0.044005000 seconds sys
If you have recommendations what to use for the other parameters, I
can run other sizes as well.
- anton
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
I gather speculative execution of both branch paths worked OK if the
branch tree wasn't too bushy. There were certainly ugly details,
e.g., if there's a trap on a path that turns out not to be taken.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
IA-64 was to prevent AMD (and others) from clones--that is all. Intel/HP would have had a patent wall 10 feet tall.
It did not fail by being insufficiently protected, it failed because it
did not perform.
-----------------
For a few cases, they have. But the problem is that these cases are
also vectorizable.
Their major problem was that they did not get enough ILP from
general-purpose code.
And even where they had enough ILP, OoO CPUs with SIMD extensions ate
their lunch.
And that should have been the end of the story. ... Sorry Ivan.
In article <10ngao4$d5o$1@gal.iecc.com>, johnl@taugh.com (John Levine)
wrote:
I gather speculative execution of both branch paths worked OK if the
branch tree wasn't too bushy. There were certainly ugly details,
e.g., if there's a trap on a path that turns out not to be taken.
Found a good CPU bug like that on an old AMD chip, the K6-II.
It happened with a floating point divide by zero in the x87 registers, guarded by a test for division by zero, with floating-point traps enabled. The divide got speculatively executed, the trap was stored, the test
revealed the divide would be by zero, the CPU tried to clean up, hit its
bug, and just stopped. Power switch time.
This only happened with the reverse divide instruction, which took the operands off the x87 stack in the opposite order from the usual FDIV. It
was rarely used, so the bug didn't become widely known. But Microsoft's compiler used it occasionally.
On Mon, 23 Feb 2026 08:06:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
The recent comparisons of branchy vs branchless binary search that we
carried on RWT forum seems to suggest that on modern CPUs branchless
variant is faster even when the table does not fit in LLC.=20
Only two explanations come to my mind:
1) The M3 has a hardware prefetcher that recognizes the pattern of a
binary array search and prefetches accordingly. The cache misses from
page table accesses might confuse the prefetcher, leading to worse
performance eventually.
Coffee Lake certainly has no such prefetcher and nevertheless exhibits >similar behavior.
I tried to run your code
<https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
on Zen4, but clang-14 converts uut2.c into branchy code. I could not
get gcc-12 to produce branchless code from slightly adapted source
code, either. My own attempt of using extended asm did not pass your
sanity checks, so eventually I used the assembly code produced by
clang-19 through godbolt.
I suppose that you have good reason for avoiding installation of clang17
or later on one of your computers.
If you have recommendations what to use for the other parameters, I
can run other sizes as well.
- anton
Run for every size from 100K to 2G in increments of x sqrt(2).
BTW, I prefer odd number of iterations.
The point at which majority of look-ups misses L3$ at least once
hopefully will be seen as change of slop on log(N) vs duration
graph for branchless variant.
I would not bother with performance counters. At least for me they
bring more confusion than insite.
According to my understanding, on Zen4 the ratio of main DRAM
latency to L3 latency is much higher than on either Coffee Lake or M3,
both of which have unified LLC instead of split L3$.
So, if on Zen2 branchy starts to win at ~2x L3 size, I will not be
shocked. But I will be somewhat surprised.
I actually have access to Zen3-based EPYC, where the above mentioned
ratio is supposed much bigger than on any competent client CPU (Intel's >Lunar/Arrow Lake do not belong to this category) but this server is
currently powered down and it's a bit of a hassle to turn it on.
On Mon, 23 Feb 2026 08:06:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 22 Feb 2026 13:17:30 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Array and Matrix scientific codes with datasets bigger than=20
cache. =20
The dense cases are covered by stride-based hardware predictors, so
they are not "otherwise". I am not familiar enough with sparse
scientific codes to comment on whether they are 1), 2), or
"otherwise".
=20
BLAS Level 3 is not particularly external/LLC bandwidth intensive
even without hardware predictors.
There are HPC applications that are bandwidth limited; that's why they
have the roofline performance model (e.g.,
<https://docs.nersc.gov/tools/performance/roofline/>).
Sure. But that's not what I would call "dense".
In my vocabulary "dense" starts at matmul(200x200,x200) or at LU >decomposition of matrix of similar dimensions.
Overwhelming majority of data served from
L2 cache.
That's with classic SIMD. It's possible that with AMX units it's no
longer true.
I very much doubt that. There is little point in adding an
instruction that slows down execution by turning it from compute-bound
to memory-bound.
It does not slow down the execution. To the contrary, it speeds it up
so much that speed of handling of L2 misses begins to matter.
Pay attention that it's just my speculation that can be wrong.
IIRC, right now binary64-capable AMX is available only on Apple Silicon
(via SME) and may be on IBM Z. I didn't play with either.
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and bandwidth >bound.
Many occur in commercial (i.e. non scientific) programs, such as
database systems. For example, imagine a company employee file (table), >with a (say 300 byte) record for each of its many thousands of employees >each containing typical employee stuff). Now suppose someone wants to
know "What is the total salary of all the employees in the "Sales" >department. With no index on "department", but it is at a fixed >displacement within each record, the code looks at each record, does a >trivial test on it, perhaps adds to a register, then goes to the next >record. This it almost certainly memory latency bound.
So adding two dense matrices tends to be memory bandwidth bound, but stride-based prefetchers help to avoid getting any extra latency
beyond that coming from the bandwidth limits (if any).
Likewise, John McCalpin's Stream benchmark uses dense vectors IIRC,
but is memory bandwidth limited.
Michael S <already5chosen@yahoo.com> writes:
On Mon, 23 Feb 2026 08:06:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
The recent comparisons of branchy vs branchless binary search
that we carried on RWT forum seems to suggest that on modern CPUs
branchless variant is faster even when the table does not fit in
LLC.=20
Only two explanations come to my mind:
1) The M3 has a hardware prefetcher that recognizes the pattern of
a binary array search and prefetches accordingly. The cache
misses from page table accesses might confuse the prefetcher,
leading to worse performance eventually.
Coffee Lake certainly has no such prefetcher and nevertheless
exhibits similar behavior.
I have now looked more closely, and the npoints parameter has a
significant influence. If it is small enough, branchless fits into
caches while branchy does not (see below), which might be one
explanation for the results. Given that I have not seen npoints
specified in any of the postings that mention branchless winning for
veclen sizes exceeding the L3 cache size, it could be this effect.
Or
maybe some other effect. Without proper parameters one cannot
reproduce the experiment to investigate it.
I tried to run your code
<https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
on Zen4, but clang-14 converts uut2.c into branchy code. I could
not get gcc-12 to produce branchless code from slightly adapted
source code, either. My own attempt of using extended asm did not
pass your sanity checks, so eventually I used the assembly code
produced by clang-19 through godbolt.
I suppose that you have good reason for avoiding installation of
clang17 or later on one of your computers.
Sure. It costs time.
If you have recommendations what to use for the other parameters, I
can run other sizes as well.
- anton
Run for every size from 100K to 2G in increments of x sqrt(2).
There is no integer n where 2G=100k*sqrt(2)^n. So I used the numbers
shown below. You did not give any indication on npoints, so I
investigated myself, and found that branchless will miss the L3 cache
with npoints=100000, so I used that and used reps=200.
BTW, I prefer odd number of iterations.
You mean odd reps? Why?
Anyway, here are the usec/point numbers:
Zen4 8700G Tiger Lake 1135G7
veclen branchless branchy branchless branchy
100000 0.030945 0.063620 0.038220 0.080542
140000 0.031035 0.068244 0.034315 0.084896
200000 0.038302 0.073819 0.045602 0.089972
280000 0.037056 0.079651 0.042108 0.096271
400000 0.046685 0.081895 0.055457 0.104561
560000 0.043028 0.088356 0.055095 0.113646
800000 0.051180 0.092201 0.074570 0.123403
1120000 0.048806 0.096621 0.088121 0.142758
1600000 0.060206 0.101069 0.131099 0.172171
2240000 0.073547 0.115428 0.167353 0.205602
3200000 0.094561 0.139996 0.208903 0.234939
4500000 0.121049 0.162757 0.244457 0.268286
6400000 0.152417 0.178611 0.292024 0.295204
9000000 0.189134 0.192100 0.320426 0.327127
12800000 0.219408 0.208083 0.372084 0.353530
18000000 0.237684 0.222140 0.418645 0.389785
25000000 0.270798 0.236786 0.462689 0.415937
35000000 0.296994 0.254001 0.526235 0.451467
50000000 0.330582 0.268768 0.599331 0.478667
70000000 0.356788 0.288526 0.622659 0.522092
100000000 0.388326 0.305980 0.698470 0.562841
140000000 0.407774 0.321496 0.737884 0.609814
200000000 0.442434 0.336242 0.848403 0.654830
280000000 0.455125 0.356382 0.902886 0.729970
400000000 0.496894 0.372735 1.120986 0.777920
560000000 0.520664 0.393827 1.173606 0.855461
800000000 0.544343 0.412087 1.759271 0.901011
1100000000 0.584389 0.431854 1.862866 0.965724
1600000000 0.614764 0.455844 2.046371 1.027111
2000000000 0.622513 0.467445 2.149251 1.089775
So branchy surpasses branchless at veclen=12.8M on both machines, for npoints=100k.
Concerning the influence of npoints, I have worked with veclen=20M in
the following.
1) If <npoints> is small, for branchy the branch predictor will
learn the pattern on the first repetition, and predict correctly in
the following repetitions; on Zen4 and Tiger Lake I see the
following percentage of branch predictions (for
<npoints>*<rep>=20_000_000, <veclen>=20_000_000):
npoints Zen4 Tiger Lake
250 0.03% 0.51%
500 0.03% 10.40%
1000 0.03% 13.51%
2000 0.07% 13.84%
4000 13.45% 12.61%
8000 15.26% 12.31%
16000 15.56% 12.26%
32000 15.60% 12.23%
80000 15.59% 12.24%
Tiger Lake counts slightly more branches than Zen4 (for the same
binary): 1746M vs. 1703M, but there is also a real lower number of
mispredictions on Tiger Lake for high npoints; at npoints=80000:
214M on Tiger Lake vs. 266M on Zen4. My guess is that in Zen4 the
mispredicts of the deciding branch interfere with the prediction of
the loop branches, and that the anti-interference measures on Tiger
Lake results in the branch predictor being less effective at
npoints=500..2000.
2) If <npoints> is small all the actually accessed array elements
will fit into some cache. Also, even if npoints and veclen are
large enough that they do not all fit, with a smaller <npoints> a
larger part of the accesses will happen to a cache, and a larger
part to a lower-level cache. With the same parameters as above,
branchy sees on a Ryzen 8700G (Zen4 with 16MB L3 cache) the
following numbers of ls_any_fills_from_sys.all_dram_io
(LLC-load-misses and l3_lookup_state.l3_miss result in <not
supported> on this machine), and on a Core i5-1135G (Tiger Lake
supported> with
8MB L3) the following number of LLC-load-misses:
branchy branchless
8700G 1135G7 8700G 1135G7
npoints fills LLC-load-misses fills LLC-load-misses
250 1_156_206 227_274 1_133_672 39_212
500 1_189_372 19_820_264 1_125_836 47_994
1000 1_170_829 125_727_181 1_130_941 96_516
2000 1_310_015 279_063_572 1_173_501 299_297
4000 73_528_665 452_147_169 1_151_042 5_661_917
8000 195_883_759 501_433_404 1_248_638 58_877_208
16000 301_559_180 511_420_040 2_688_530 101_811_222
32000 389_512_147 511_713_759 29_799_206 116_312_019
80000 402_131_449 512_460_513 91_276_752 118_762_341
The 16MB L3 of the 8700G cache has 262_144 cache lines, the 8MB L3
of the 1135G7 has 131_072 cache lines. How come branchy has so many
cache misses already at relatively low npoints? Each search
performs ~25 architectural memory accesses; in addition, in branchy
we see a number of misspeculated memory accesses for each
architectural access, resulting in the additional memory accesses.
For branchless the 8700G sees a ramp-up of memory accesses more than
the factor of 2 later compared to the 1135G7 that the differences in
cache sizes would suggest. The 1135G7 cache is divided into 4 2MB
slices, and accesses are assigned by physical address (i.e., the L3
cache does not function like an 8MB cache with a higher
associativity), but given the random-access nature of this workload,
I would expect the accesses to be distributed evenly across the
slices, with little bad effects from this cache organization.
Given the memory access numbers of branchy and branchless above, I
expected to see a speedup of branchy in those cases where branchless
has a lot of L3 misses, so I decided to use npoints=100k and rep=200
in the experiments further up. And my expectations turned out to be
right, at least on these two machines.
- anton
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and bandwidth
bound.
If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
differences in the core play no role, either; they all have to wait
for memory.
For latency various forms of prefetching (by hardware or software) can
help.
Many occur in commercial (i.e. non scientific) programs, such as
database systems. For example, imagine a company employee file (table),
with a (say 300 byte) record for each of its many thousands of employees
each containing typical employee stuff). Now suppose someone wants to
know "What is the total salary of all the employees in the "Sales"
department. With no index on "department", but it is at a fixed
displacement within each record, the code looks at each record, does a
trivial test on it, perhaps adds to a register, then goes to the next
record. This it almost certainly memory latency bound.
If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
that, or because the allocation happened in a way that resulted in
such an arrangement, stride-based prefetching will prefetch the
accessed fields and reduce the latency to the one due to bandwidth
limits.
If the records are stored randomly, but are pointed to by an array,
one can prefetch the relevant fields easily, again turning the problem
into a latency-bound problem. If, OTOH, the records are stored
randomly and are in a linked list, this problem is a case of
pointer-chasing and is indeed latency-bound.
BTW, thousands of employee records, each with 300 bytes, fit in the L2
or L3 cache of modern processors.
Michael S <already5chosen@yahoo.com> writes:
On Mon, 23 Feb 2026 08:06:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 22 Feb 2026 13:17:30 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Array and Matrix scientific codes with datasets bigger than=20
cache. =20
The dense cases are covered by stride-based hardware
predictors, so they are not "otherwise". I am not familiar
enough with sparse scientific codes to comment on whether they
are 1), 2), or "otherwise".
=20
BLAS Level 3 is not particularly external/LLC bandwidth intensive
even without hardware predictors.
There are HPC applications that are bandwidth limited; that's why
they have the roofline performance model (e.g.,
<https://docs.nersc.gov/tools/performance/roofline/>).
Sure. But that's not what I would call "dense".
In my vocabulary "dense" starts at matmul(200x200,x200) or at LU >decomposition of matrix of similar dimensions.
"Dense matrices" vs. "sparse matrices", not in terms of FLOPS/memory
access.
So adding two dense matrices tends to be memory bandwidth bound, but stride-based prefetchers help to avoid getting any extra latency
beyond that coming from the bandwidth limits (if any).
Likewise, John McCalpin's Stream benchmark uses dense vectors IIRC,
but is memory bandwidth limited.
Overwhelming majority of data served from
L2 cache.
That's with classic SIMD. It's possible that with AMX units it's
no longer true.
I very much doubt that. There is little point in adding an
instruction that slows down execution by turning it from
compute-bound to memory-bound.
It does not slow down the execution. To the contrary, it speeds it up
so much that speed of handling of L2 misses begins to matter.
Pay attention that it's just my speculation that can be wrong.
IIRC, right now binary64-capable AMX is available only on Apple
Silicon (via SME) and may be on IBM Z. I didn't play with either.
AMX is an Intel extension of AMD64 (and IA-32); ARM's SME also has
"matrix" in its name, but is not AMX.
Looking at
<https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions>, it seems
to me that the point of AMX is to deal with small matrices (16x64
times 64x16 for Int8, 16x32 times 32x16 for 16-bit types) of small
elements (INT8, BF16, FP16 and complex FP16 numbers) in a special
unit. Apparently the AMX unit in Granite Rapids consumes 2048 bytes
in 16 cycles, i.e., 128 bytes per cycle and produces 256 or 512 bytes
in these 16 cycles. If every of these matrix multiplication happens
happens on its own, the result will certainly be bandwidth-bound to L2
and maybe already to L1. If, OTOH, these operations are part of a
larger matrix multiplication, then cache blocking can probably lower
the bandwidth to L2 enough, and reusing one of the operands in
registers can lower the bandwidth to L1 enough.
In any case, Intel will certainly not add hardware that exceeds the
bandwidth boundaries in all common use cases.
- anton
On Tue, 24 Feb 2026 09:50:43 GMT[...]
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 23 Feb 2026 08:06:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I think that it was said more than once throughout the thread that all >measurements were taken with npoints=1M and rep=11.
And for small rep
even-size median has non-negligible bias.
Intuitively, I don't like measurements with so smal npoints parameter,
but objectively they are unlikely to be much different from 1M points.
On Tiger Lake ratio is smaller than on others, probably because of
Intel's slow "mobile" memory controller
So far I don't see how your measurements disprove my theory of page
table not fitting in L2 cache.
If I find the time, I would like to see how branchless with software prefetching performs. And I would like to put all of that, including
your code, online. Do I have your permission to do so?
- anton
Let me better explain what I was trying to set up, then you can tell me >where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few >instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
On Tue, 24 Feb 2026 10:46:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
The question is not whether bandwidth from L2 to AMX is sufficient. It
is.
The interesting part is what bandwidth from LLC *into* L2 is needed in >scenario of multiplication of big matrices.
Supposedly 1/3rd of L2 is used for matrix A that stays here for very
long time, another 1/3rd holds matrix C that remains here for, say, 8 >iterations and remaining 1/3rd is matrix B that should be reloaded on
every iteration. AMX increases the frequency at which we have to reload
B and C.
Assuming L2 = 2MB and square matrices , N = sqrt(2MB/8B/3) = 295,
rounded down to 288. Each step takes 288**3= 24M FMA operations. If
future AMX does 128 FMA per clock then the iteration take 188K clocks.
288**2 * 8B / 188K = 3.5 B/clock. That's easy for 1 AMX/L2 combo, but
very hard when you have 64 units like those competing on the same
LLC.
Michael S <already5chosen@yahoo.com> writes:
On Tue, 24 Feb 2026 09:50:43 GMT[...]
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 23 Feb 2026 08:06:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I think that it was said more than once throughout the thread that
all measurements were taken with npoints=1M and rep=11.
I obviously missed it, that's why I asked for the parameters earlier.
And for small rep
even-size median has non-negligible bias.
Even-size median is the arithmetic mean between the (n-1)/2-lowest and
the (n+1)/2-lowest result. What bias do you have in mind?
Here are the results with npoints=1M reps=11:
Zen4 8700G Tiger Lake 1135G7
veclen branchless branchy branchless branchy
100000 0.030187 0.063023 0.038614 0.081332
140000 0.029779 0.066112 0.034797 0.085259
200000 0.037532 0.072898 0.045945 0.090583
280000 0.036677 0.078795 0.042678 0.097386
400000 0.045777 0.084639 0.055256 0.104249
560000 0.043092 0.086424 0.055677 0.112252
800000 0.050325 0.091714 0.078172 0.123927
1120000 0.048855 0.095756 0.091508 0.141819
1600000 0.061101 0.099660 0.133560 0.168403
2240000 0.082970 0.113055 0.166363 0.205129
3200000 0.120834 0.137753 0.211253 0.235155
4500000 0.149028 0.160359 0.241716 0.268885
6400000 0.178675 0.177070 0.294336 0.294931
9000000 0.219919 0.195292 0.322921 0.326113
12800000 0.240604 0.209397 0.380076 0.352707
18000000 0.257400 0.224952 0.409813 0.388834
25000000 0.293160 0.239217 0.472999 0.413559
35000000 0.307939 0.257587 0.522223 0.450754
50000000 0.346399 0.271420 0.595265 0.478713
70000000 0.375525 0.286462 0.626350 0.516533
100000000 0.401462 0.305444 0.699097 0.555928
140000000 0.419590 0.322480 0.732551 0.611815
200000000 0.451003 0.337880 0.854034 0.647795
280000000 0.468193 0.359133 0.914159 0.729234
400000000 0.506300 0.372219 1.150605 0.773266
560000000 0.520346 0.390277 1.167567 0.851592
800000000 0.556292 0.412035 1.799293 0.899768
1100000000 0.572654 0.432598 1.861136 0.963649
1600000000 0.618735 0.450526 2.062255 1.026560
2000000000 0.629088 0.465219 2.132035 1.073728
Intuitively, I don't like measurements with so smal npoints
parameter, but objectively they are unlikely to be much different
from 1M points.
It's pretty similar, but the first point where branchy is faster is veclen=6.4M for the 8700G and still 12.8M for the 1135G7; on the
1135G7 the 6.4M and 9M veclens are pretty close for both npoints.
My current theory why the crossover is so late is twofold:
1) branchy performs a lot of misspeculated accesses, which reduce the
cache hit rate (and increasing the cache miss rate) already at
relatively low npoints (as shown in <2026Feb24.105043@mips.complang.tuwien.ac.at>), and likely also for
low veclen with high npoints. E.g., already npoints=4000 has a >2M
fills from memory on the 8700G, and npoints=500 has >2M
LLC-load-misses at 1135G7, whereas the corresponding numbers for
branchless are npoints=16000 and npoints=4000. This means that
branchy does not just suffer from the misprediction penalty, but also
from fewer cache hits for the middle stages of the search. How strong
this effect is depends on the memory subsystem of the CPU core.
2) In the last levels of the larger searches the better memory-level parallelism of branchy leads to branchy catching up and eventually
overtaking branchless, but it has to first compensate for the slowness
in the earlier levels before it can reach crossover. That's why we
see the crossover point at significantly larger sizes than L3.
On Tiger Lake ratio is smaller than on others, probably because of
Intel's slow "mobile" memory controller
This particular machine has 8GB DDR4 soldered in plus 32GB DDR4 on a
separate DIMM, so the memory controller may be faultless.
So far I don't see how your measurements disprove my theory of page
table not fitting in L2 cache.
I did not try to disprove that; if I did, I would try to use huge
pages (the 1G ones if possible) and see how that changes the results.
But somehow I fail to see why the page table walks should make much of
a difference.
If I find the time, I would like to see how branchless with software prefetching performs. And I would like to put all of that, including
your code, online. Do I have your permission to do so?
- anton
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Let me better explain what I was trying to set up, then you can tell me
where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or
outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level parallelism of the hardware. If the records are in RAM, the hardware prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and bandwidth
bound.
If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
differences in the core play no role, either; they all have to wait
for memory.
For latency various forms of prefetching (by hardware or software) can
help.
Many occur in commercial (i.e. non scientific) programs, such as
database systems. For example, imagine a company employee file (table),
with a (say 300 byte) record for each of its many thousands of employees
each containing typical employee stuff). Now suppose someone wants to
know "What is the total salary of all the employees in the "Sales"
department. With no index on "department", but it is at a fixed
displacement within each record, the code looks at each record, does a
trivial test on it, perhaps adds to a register, then goes to the next
record. This it almost certainly memory latency bound.
If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
that, or because the allocation happened in a way that resulted in
such an arrangement, stride-based prefetching will prefetch the
accessed fields and reduce the latency to the one due to bandwidth
limits.
If the records are stored randomly, but are pointed to by an array,
one can prefetch the relevant fields easily, again turning the problem
into a latency-bound problem. If, OTOH, the records are stored
randomly and are in a linked list, this problem is a case of
pointer-chasing and is indeed latency-bound.
BTW, thousands of employee records, each with 300 bytes, fit in the L2
or L3 cache of modern processors.
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Let me better explain what I was trying to set up, then you can tell meI think that it's bandwidth-bound, because none of the memory (or
where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound. >>
outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level
parallelism of the hardware. If the records are in RAM, the hardware
prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited.
Think of it this way. In the current
situation, increasing the memory system bandwidth, say by hypothetically >increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand, >doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance.
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Let me better explain what I was trying to set up, then you can tell me
where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level parallelism of the hardware. If the records are in RAM, the hardware prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited. Think of it this way. In the current situation, increasing the memory system bandwidth, say by hypothetically increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand, doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance. To me, that is the definition
of being latency bound, not bandwidth bound.
Perhaps this distinction is clearer to me due to my background in the
(hard) disk business. You want lower latency? Make the arm move faster
or spin the disk faster. You want higher bandwidth? Put more bits on a track or interleave the data across multiple disk heads. And in a
system, the number of active prefetches is naturally limited by the
number of disk arms you have.
On 2/24/2026 5:25 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and bandwidth >> bound.
If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
differences in the core play no role, either; they all have to wait
for memory.
For latency various forms of prefetching (by hardware or software) can help.
Many occur in commercial (i.e. non scientific) programs, such as
database systems. For example, imagine a company employee file (table), >> with a (say 300 byte) record for each of its many thousands of employees >> each containing typical employee stuff). Now suppose someone wants to
know "What is the total salary of all the employees in the "Sales"
department. With no index on "department", but it is at a fixed
displacement within each record, the code looks at each record, does a
trivial test on it, perhaps adds to a register, then goes to the next
record. This it almost certainly memory latency bound.
If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
that, or because the allocation happened in a way that resulted in
such an arrangement, stride-based prefetching will prefetch the
accessed fields and reduce the latency to the one due to bandwidth
limits.
If the records are stored randomly, but are pointed to by an array,
one can prefetch the relevant fields easily, again turning the problem
into a latency-bound problem. If, OTOH, the records are stored
randomly and are in a linked list, this problem is a case of pointer-chasing and is indeed latency-bound.
BTW, thousands of employee records, each with 300 bytes, fit in the L2
or L3 cache of modern processors.
FWIW:
IME, code with fairly random access patterns to memory, and lots of
cache misses, is inherently slow; even on big/fancy OoO chips.
Seemingly about the only real hope the CPU has is to have a large cache and just
hope that the data happens to be in the cache (and has been accessed previously or sufficiently recently) else it is just kinda SOL.
If there is some way that CPU's can guess what memory they need in
advance and fetch it beforehand, I have not seen much evidence of this personally.
Rather, as can be noted, memory access patterns can often make a fairly large impact on the performance of some algorithms.
Like, for example, decoding a PNG like format vs a JPG like format:
PNG decoding typically processes the image as several major phases:
Decompress the Deflate-compressed buffer into memory;
Walk over the image, running scanline filters,
copying scanlines into a new (output) buffer.
With matrixes of just the right size, one can achieve a TLB miss on
every 8th access.
On 2/24/2026 5:25 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated?-a Tree searching and binary search in arrays come to >>>> mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and bandwidth >>> bound.
If a problem is bandwidth-bound, then differences between conventional>> architectures and EPIC play no role, and microarchitectural
differences in the core play no role, either; they all have to wait
for memory.
For latency various forms of prefetching (by hardware or software) can>> help.
Many occur in commercial (i.e. non scientific) programs, such as
database systems.-a For example, imagine a company employee file (table), >>> with a (say 300 byte) record for each of its many thousands of employees >>> each containing typical employee stuff).-a Now suppose someone wants to
know "What is the total salary of all the employees in the "Sales"
department.-a With no index on "department", but it is at a fixed>>> displacement within each record, the code looks at each record, does a
trivial test on it, perhaps adds to a register, then goes to the next>>> record.-a This it almost certainly memory latency bound.
If the records are stored sequentially, either because the programming>> language supports that arrangement and the programmer made use of
that, or because the allocation happened in a way that resulted in
such an arrangement, stride-based prefetching will prefetch the
accessed fields and reduce the latency to the one due to bandwidth
limits.
If the records are stored randomly, but are pointed to by an array,
one can prefetch the relevant fields easily, again turning the problem>> into a latency-bound problem.-a If, OTOH, the records are stored
randomly and are in a linked list, this problem is a case of
pointer-chasing and is indeed latency-bound.
BTW, thousands of employee records, each with 300 bytes, fit in the L2>> or L3 cache of modern processors.
FWIW:
IME, code with fairly random access patterns to memory, and lots of
cache misses, is inherently slow; even on big/fancy OoO chips. Seemingly about the only real hope the CPU has is to have a large cache and just > hope that the data happens to be in the cache (and has been accessed
previously or sufficiently recently) else it is just kinda SOL.
If there is some way that CPU's can guess what memory they need in
advance and fetch it beforehand, I have not seen much evidence of this > personally.
Rather, as can be noted, memory access patterns can often make a fairly large impact on the performance of some algorithms.
Like, for example, decoding a PNG like format vs a JPG like format:Could you have a secondary thread that started as soon as one (or a
-a PNG decoding typically processes the image as several major phases:
-a-a-a Decompress the Deflate-compressed buffer into memory;
-a-a-a Walk over the image, running scanline filters,
-a-a-a-a-a copying scanlines into a new (output) buffer.
Even if the parts, taken in isolation, should be fast:This reminds me of CABAC decoding in h264, where the output of the
-a The image buffers are frequently too large to fit in cache;
-a Cache misses tend to make PNG decoding painfully slow,
-a-a-a even when using faster filters.
-a-a-a If using the Paeth filter though, this adds extra slowness,
-a-a-a-a-a due to branch-predictor misses.
-a-a-a-a-a On targets like x86,
-a-a-a-a-a-a-a the filter is frequently implemented using branches;
-a-a-a-a-a-a-a The branch miss rate is very high.
-a-a-a-a-a-a-a So, a naive branching version, performs like dog crap.
So, net result: Despite its conceptual simplicity, PNG's decode-time performance typically sucks.
Contrast, a decoder for a JPEG like format can be made to process one
block at a time and go all the way to final output. So, JPEG is often
faster despite the more complex process (with transform stages and a colorspace transform).
The Paeth filter slowness does seem a little odd though:I would look for a way to handle multiple pixels at once, with SIMD
Theoretically, a CPU could turn a short forward branch into predication;
But, this doesn't tend to be the case.
It then is faster to turn the filter into some convoluted mess of
arithmetic and masking in an attempt to reduce the branch mispredict costs.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or
outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level
parallelism of the hardware. If the records are in RAM, the hardware
prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited.
I mentioned several limits. Which one do you have in mind?
the limit of memory-level parallelism of the hardware.
Think of it this way. In the current
situation, increasing the memory system bandwidth, say by hypothetically
increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand,
doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance.
If the CPU is designed to provide enough memory-level parallelism to
make use of the bandwidth (and that is likely, otherwise why provide
that much bandwidth), then once the designers spend money on
increasing the bandwidth, they will also spend the money necessary to increase the MLP.
Concerning a reduction in latency, that would not
increase performance, because this application is already working at
the bandwidth limit.
I feel the urge to write up a mock variant of your use case and
measure whether reality confirms my expectations, but currently have
no time for that.
But let's take a slightly simpler case for which I already have a
program:
On Tue, 24 Feb 2026 17:30:31 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
This particular machine has 8GB DDR4 soldered in plus 32GB DDR4 on a
separate DIMM, so the memory controller may be faultless.
Do you mean that dealing with two different types of memory is
objectively hard job?
Or that latency of 1135G7 is not to bad?
It all looks lake memory latency on this TGL is ~90ns
If I find the time, I would like to see how branchless with software
prefetching performs. And I would like to put all of that, including
your code, online. Do I have your permission to do so?
- anton
Does not SW prefetch turn itself into NOP on TLB miss?
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Let me better explain what I was trying to set up, then you can tell meI think that it's bandwidth-bound, because none of the memory (or
where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound. >>
outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level
parallelism of the hardware. If the records are in RAM, the hardware
prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited. Think of it this way. In the current situation, increasing the memory system bandwidth, say by hypothetically increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand, doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance. To me, that is the definition
of being latency bound, not bandwidth bound.
Perhaps this distinction is clearer to me due to my background in the
(hard) disk business. You want lower latency? Make the arm move faster
or spin the disk faster. You want higher bandwidth? Put more bits on a track or interleave the data across multiple disk heads. And in a
system, the number of active prefetches is naturally limited by the
number of disk arms you have.
BGB wrote:
On 2/24/2026 5:25 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated?-a Tree searching and binary search in arrays come to >>>>> mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and
bandwidth
bound.
If a problem is bandwidth-bound, then differences between conventional
architectures and EPIC play no role, and microarchitectural
differences in the core play no role, either; they all have to wait
for memory.
For latency various forms of prefetching (by hardware or software) can
help.
Many occur in commercial (i.e. non scientific) programs, such as
database systems.-a For example, imagine a company employee file
(table),
with a (say 300 byte) record for each of its many thousands of
employees
each containing typical employee stuff).-a Now suppose someone wants to >>>> know "What is the total salary of all the employees in the "Sales"
department.-a With no index on "department", but it is at a fixed
displacement within each record, the code looks at each record, does a >>>> trivial test on it, perhaps adds to a register, then goes to the next
record.-a This it almost certainly memory latency bound.
If the records are stored sequentially, either because the programming
language supports that arrangement and the programmer made use of
that, or because the allocation happened in a way that resulted in
such an arrangement, stride-based prefetching will prefetch the
accessed fields and reduce the latency to the one due to bandwidth
limits.
If the records are stored randomly, but are pointed to by an array,
one can prefetch the relevant fields easily, again turning the problem
into a latency-bound problem.-a If, OTOH, the records are stored
randomly and are in a linked list, this problem is a case of
pointer-chasing and is indeed latency-bound.
BTW, thousands of employee records, each with 300 bytes, fit in the L2
or L3 cache of modern processors.
FWIW:
IME, code with fairly random access patterns to memory, and lots of
cache misses, is inherently slow; even on big/fancy OoO chips.
Seemingly about the only real hope the CPU has is to have a large
cache and just hope that the data happens to be in the cache (and has
been accessed previously or sufficiently recently) else it is just
kinda SOL.
If there is some way that CPU's can guess what memory they need in
advance and fetch it beforehand, I have not seen much evidence of this
personally.
Rather, as can be noted, memory access patterns can often make a
fairly large impact on the performance of some algorithms.
Like, for example, decoding a PNG like format vs a JPG like format:
-a-a PNG decoding typically processes the image as several major phases:
-a-a-a-a Decompress the Deflate-compressed buffer into memory;
-a-a-a-a Walk over the image, running scanline filters,
-a-a-a-a-a-a copying scanlines into a new (output) buffer.
Could you have a secondary thread that started as soon as one (or a
small number of) scanline(s) were available, taking advantage of any
shared $L3 cache to grab the data before it is blown away?
Even if the parts, taken in isolation, should be fast:
-a-a The image buffers are frequently too large to fit in cache;
-a-a Cache misses tend to make PNG decoding painfully slow,
-a-a-a-a even when using faster filters.
-a-a-a-a If using the Paeth filter though, this adds extra slowness,
-a-a-a-a-a-a due to branch-predictor misses.
-a-a-a-a-a-a On targets like x86,
-a-a-a-a-a-a-a-a the filter is frequently implemented using branches;
-a-a-a-a-a-a-a-a The branch miss rate is very high.
-a-a-a-a-a-a-a-a So, a naive branching version, performs like dog crap.
This reminds me of CABAC decoding in h264, where the output of the arithmetic decoder is single bits that by definition cannot be
predictable, but the codec typically uses that bit to branch.
So, net result: Despite its conceptual simplicity, PNG's decode-time
performance typically sucks.
Contrast, a decoder for a JPEG like format can be made to process one
block at a time and go all the way to final output. So, JPEG is often
faster despite the more complex process (with transform stages and a
colorspace transform).
The Paeth filter slowness does seem a little odd though:
Theoretically, a CPU could turn a short forward branch into predication;
But, this doesn't tend to be the case.
It then is faster to turn the filter into some convoluted mess of
arithmetic and masking in an attempt to reduce the branch mispredict
costs.
I would look for a way to handle multiple pixels at once, with SIMD
code: There the masking/combining is typically the easiest way to
implement short branches.
(I might take a look a png decoding at some point)
Terje
On 2/28/2026 9:48 AM, Terje Mathisen wrote:
This reminds me of CABAC decoding in h264, where the output of the
arithmetic decoder is single bits that by definition cannot be
predictable, but the codec typically uses that bit to branch.
Yeah.
Making arithmetic and range coders fast is also hard.
I don't often use them as much because I am not aware of a good way to > make them fast.
This is part of why I had often ended up going for STF+AdRice or
similar, which, while not the best in terms of compression, can be one > of the faster options in many cases.
Theoretically, table-driven Huffman could be faster, but likewise often suffers from cache misses (cycles lost to cache misses can outweigh the
cost of the more complex logic of an AdRice coder).
Huffman speed can be improved by reducing maximum symbol length and
table size, but then this can lose much its compression advantage.
Say, max symbol length:
-a 10/11: Too short, limits effectiveness.
-a 12: OK, leans faster;
-a 13: OK, leans better compression;
-a 14: Intermediate
-a 15: Slower still (Deflate is here)
-a 16: Slower (T.81 JPEG is here)
Where, for 12/13 bits, the fastest strategy is typically to use a singleI have looked at multi-level table lookups, where the symbol either is
big lookup table for the entropy decoder.
For 15 or 16 bits, it is often faster to have a separate fast-path and > slow path. Say, fast path matches on the first 9 or 10 bits, and then
the slow path falls back to a linear search (over the longer symbols).
In this case, the relative slowness of falling back to a linear search > being less than that of the cost of the L1 misses from a bigger lookup > table.Yeah.
The relative loss of Huffman coding efficiency between a 13 bit limit
and 15 bit limit is fairly modest.
I would look for a way to handle multiple pixels at once, with SIMD
code: There the masking/combining is typically the easiest way to
implement short branches.
(I might take a look a png decoding at some point)
#if 0-a //naive version, pays a lot for branch penaltiesOK
int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
{
-a-a-a-aint p, pa, pb, pc;
-a-a-a-ap=a+b-c;
-a-a-a-apa=(p>a)?(p-a):(a-p);
-a-a-a-apb=(p>b)?(p-b):(b-p);
-a-a-a-apc=(p>c)?(p-c):(c-p);
-a-a-a-ap=(pa<=pb)?((pa<=pc)?a:c):((pb<=pc)?b:c);
-a-a-a-areturn(p);
}
#endif
#if 1-a-a-a //avoid branch penalties
int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
{
-a-a-a-aint p, pa, pb, pc;
-a-a-a-aint ma, mb, mc;
-a-a-a-ap=a+b-c;
-a-a-a-apa=p-a;-a-a-a-a-a-a-a pb=p-b;-a-a-a-a-a-a-a pc=p-c;
-a-a-a-ama=pa>>31;-a-a-a mb=pb>>31;-a-a-a mc=pc>>31;
-a-a-a-a-a-a-a pa=pa^ma;-a-a-a pb=pb^mb;-a-a-a pc=pc^mc;
-a-a-a-ama=pb-pa;-a-a-a mb=pc-pb;-a-a-a mc=pc-pa;
-a-a-a-ama=ma>>31;-a-a-a mb=mb>>31;-a-a-a mc=mc>>31;
-a-a-a-ap=(ma&((mb&c)|((~mb)&b))) | ((~ma)&((mc&c)|((~mc)&a)));
-a-a-a-areturn(p);
}
#endif
Where, the Paeth filter is typically the most heavily used filter in PNG decoding (because it tends to be the most accurate), but also the slowest.
Could in theory be SIMD'ed to maybe work on RGB or RGBA in parallel.
On 2/27/2026 1:52 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or
outer-level cache) accesses depend on the results of previous ones; so >>> the loads can be started right away, up to the limit of memory-level
parallelism of the hardware. If the records are in RAM, the hardware
prefetcher can help to avoid running into the scheduler and ROB limits >>> of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited.
I mentioned several limits. Which one do you have in mind?
The one you mentioned in your last paragraph, specifically,
the limit of memory-level parallelism of the hardware.
Think of it this way. In the current
situation, increasing the memory system bandwidth, say by hypothetically >> increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand, >> doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance.
If the CPU is designed to provide enough memory-level parallelism to
make use of the bandwidth (and that is likely, otherwise why provide
that much bandwidth), then once the designers spend money on
increasing the bandwidth, they will also spend the money necessary to increase the MLP.
No. The memory system throughput depends upon the access pattern. It
is easier/lower cost to increase the throughput for sequential accesses
than random (think wider interfaces, cache blocks larger than the amount accessed, etc.)
But optimization for sequential workloads can actually
hurt performance for random workloads, e.g. larger block sizes reduce
the number of accesses for sequential workloads, but each access takes longer, thus hurting random workloads. So
designers aim to maximize the throughput, subject to cost and technology constraints, for some mix of sequential (bandwidth) versus random (latency) access.
BGB wrote:
On 2/28/2026 9:48 AM, Terje Mathisen wrote:
This reminds me of CABAC decoding in h264, where the output of the
arithmetic decoder is single bits that by definition cannot be
predictable, but the codec typically uses that bit to branch.
Yeah.
Making arithmetic and range coders fast is also hard.
I don't often use them as much because I am not aware of a good way to
make them fast.
This is part of why I had often ended up going for STF+AdRice or
similar, which, while not the best in terms of compression, can be one
of the faster options in many cases.
Theoretically, table-driven Huffman could be faster, but likewise
often suffers from cache misses (cycles lost to cache misses can
outweigh the cost of the more complex logic of an AdRice coder).
Huffman speed can be improved by reducing maximum symbol length and
table size, but then this can lose much its compression advantage.
Say, max symbol length:
-a-a 10/11: Too short, limits effectiveness.
-a-a 12: OK, leans faster;
-a-a 13: OK, leans better compression;
-a-a 14: Intermediate
-a-a 15: Slower still (Deflate is here)
-a-a 16: Slower (T.81 JPEG is here)
Where, for 12/13 bits, the fastest strategy is typically to use a
single big lookup table for the entropy decoder.
For 15 or 16 bits, it is often faster to have a separate fast-path and
slow path. Say, fast path matches on the first 9 or 10 bits, and then
the slow path falls back to a linear search (over the longer symbols).
I have looked at multi-level table lookups, where the symbol either is
the one you want (short codes) or an index into a list of secondary
tables to be used on the remaining bits.
When you have many really short codes (think Morse!) , then you can profitably decode multiple in a single iteration.
In this case, the relative slowness of falling back to a linear search
being less than that of the cost of the L1 misses from a bigger lookup
table.
The relative loss of Huffman coding efficiency between a 13 bit limit
and 15 bit limit is fairly modest.
Yeah.
I would look for a way to handle multiple pixels at once, with SIMD
code: There the masking/combining is typically the easiest way to
implement short branches.
(I might take a look a png decoding at some point)
#if 0-a //naive version, pays a lot for branch penalties
int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
{
-a-a-a-a-aint p, pa, pb, pc;
-a-a-a-a-ap=a+b-c;
-a-a-a-a-apa=(p>a)?(p-a):(a-p);
-a-a-a-a-apb=(p>b)?(p-b):(b-p);
-a-a-a-a-apc=(p>c)?(p-c):(c-p);
-a-a-a-a-ap=(pa<=pb)?((pa<=pc)?a:c):((pb<=pc)?b:c);
-a-a-a-a-areturn(p);
}
#endif
#if 1-a-a-a //avoid branch penalties
int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
{
-a-a-a-a-aint p, pa, pb, pc;
-a-a-a-a-aint ma, mb, mc;
-a-a-a-a-ap=a+b-c;
-a-a-a-a-apa=p-a;-a-a-a-a-a-a-a pb=p-b;-a-a-a-a-a-a-a pc=p-c;
-a-a-a-a-ama=pa>>31;-a-a-a mb=pb>>31;-a-a-a mc=pc>>31;
-a-a-a-a-a-a-a-a pa=pa^ma;-a-a-a pb=pb^mb;-a-a-a pc=pc^mc;
-a-a-a-a-ama=pb-pa;-a-a-a mb=pc-pb;-a-a-a mc=pc-pa;
-a-a-a-a-ama=ma>>31;-a-a-a mb=mb>>31;-a-a-a mc=mc>>31;
-a-a-a-a-ap=(ma&((mb&c)|((~mb)&b))) | ((~ma)&((mc&c)|((~mc)&a)));
-a-a-a-a-areturn(p);
}
#endif
Where, the Paeth filter is typically the most heavily used filter in
PNG decoding (because it tends to be the most accurate), but also the
slowest.
Could in theory be SIMD'ed to maybe work on RGB or RGBA in parallel.
OK
On 3/1/2026 12:02 PM, Terje Mathisen wrote:
BGB wrote:
On 2/28/2026 9:48 AM, Terje Mathisen wrote:
This reminds me of CABAC decoding in h264, where the output of the
arithmetic decoder is single bits that by definition cannot be
predictable, but the codec typically uses that bit to branch.
Yeah.
Making arithmetic and range coders fast is also hard.
I don't often use them as much because I am not aware of a good way to
make them fast.
This is part of why I had often ended up going for STF+AdRice or
similar, which, while not the best in terms of compression, can be one
of the faster options in many cases.
Theoretically, table-driven Huffman could be faster, but likewise
often suffers from cache misses (cycles lost to cache misses can
outweigh the cost of the more complex logic of an AdRice coder).
Huffman speed can be improved by reducing maximum symbol length and
table size, but then this can lose much its compression advantage.
Say, max symbol length:
-a-a 10/11: Too short, limits effectiveness.
-a-a 12: OK, leans faster;
-a-a 13: OK, leans better compression;
-a-a 14: Intermediate
-a-a 15: Slower still (Deflate is here)
-a-a 16: Slower (T.81 JPEG is here)
Where, for 12/13 bits, the fastest strategy is typically to use a
single big lookup table for the entropy decoder.
For 15 or 16 bits, it is often faster to have a separate fast-path and
slow path. Say, fast path matches on the first 9 or 10 bits, and then
the slow path falls back to a linear search (over the longer symbols).
I have looked at multi-level table lookups, where the symbol either is
the one you want (short codes) or an index into a list of secondary
tables to be used on the remaining bits.
Can work OK if one assumes all of the longer codes are prefixed by a
longish series of 1s, which is usually but not necessarily true.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or
outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level
parallelism of the hardware. If the records are in RAM, the hardware
prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited. Think of it this way. In the current
situation, increasing the memory system bandwidth, say by hypothetically
increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand,
doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance. To me, that is the definition
of being latency bound, not bandwidth bound.
I agree with your definition, but my prediction is somewhat different.
First, consider silly program that goes sequentially over larger
array accessing all lines. AFAICS it you should see tiny effect when
program uses only one byte from each line compared to using whole
line. Now consider variant that accesses every fifth line.
There are differences, one that prefetcher needs to realize that
there is no need to prefetch intermediate lines. Second difference
is that one can fetch lines quickly only when they are on a single
page. Having "step 5" on lines means 5 times as many page crossings.
I do not know how big are pages in modern DRAM, but at step large
enough you will see significant delay due to page crossing. I
would tend to call this delay "latency", but it is somewhat
murky. Namely, with enough prefetch and enough memory banks
you can still saturate a single channel to the core (assuming
that there are many cores, many channels from memory controller
to memory banks but only single channel between memory controller
and each core. Of course, modern system tend to have limited
number of memory banks, so the argument above is purely theoretical.
Somewhat different case is when there are independent loads from
random locations, something like
for(i = 0; i < N; i++) {
s += m[f(i)];
}
where 'f' is very cheap to compute, but hard to predict by the
hardware. In case above reorder buffer and multiple banks helps,
but even with unlimited CPU resurces maximal number of accesses
is number of memory banks divided by access time of single bank
(that is essentialy latency of memory array).
Then there is pointer chasing case, like
for(i = 0; i < N; i++) {
j = m[j];
}
when 'm' is filled with semi-random cyclic pattern this behaves
quite badly, basically you can start next access only when you
have result of the previous access. In practice, large 'm'
seem to produce large number of cache misses for TLB entries.
Perhaps this distinction is clearer to me due to my background in the
(hard) disk business. You want lower latency? Make the arm move faster
or spin the disk faster. You want higher bandwidth? Put more bits on a
track or interleave the data across multiple disk heads. And in a
system, the number of active prefetches is naturally limited by the
number of disk arms you have.
That disk analogy is flaved. AFAIK there is no penalty for choosing
"far away" pages compared to "near" ones (if any the opposite: Row
Hammer shows that accesses to "near" pages mean that given page may
need more frequent refresh). In case of memory, time spent in
memory controller is non-negligible, at least for accesses within
single page. AFAIK random access to lines withing single page
costs no more than sequential access, for disk you want sequential
access to a single track.
On 2/28/2026 1:49 PM, Waldek Hebisch wrote:----------------------
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Somewhat different case is when there are independent loads from
random locations, something like
for(i = 0; i < N; i++) {
s += m[f(i)];
}
where 'f' is very cheap to compute, but hard to predict by the
hardware. In case above reorder buffer and multiple banks helps,
but even with unlimited CPU resurces maximal number of accesses
is number of memory banks divided by access time of single bank
(that is essentialy latency of memory array).
Then there is pointer chasing case, like
for(i = 0; i < N; i++) {
j = m[j];
}
when 'm' is filled with semi-random cyclic pattern this behaves
quite badly, basically you can start next access only when you
have result of the previous access. In practice, large 'm'
seem to produce large number of cache misses for TLB entries.
Well, the examples I gave got confusing (my fault) because, as Anton
pointed out, the table I used would fit into L3 cache on many modern systems. So this all got tied up in the difference between in cache
results and in DRAM results. I don't disagree with your points, but
they are tangential to the point I was trying to make.
Let me try again. Suppose you had a (totally silly) program with a 2 GB array, and you used a random number generate an address within it, then added the value at that addressed byte to an accumulator. Repeat say
10,000 times. I would call this program latency bound, but I suspect
Anton would call it bandwidth bound. If that is true, then that
explains the original differences Anton and I had.
Perhaps this distinction is clearer to me due to my background in the
(hard) disk business. You want lower latency? Make the arm move faster >> or spin the disk faster. You want higher bandwidth? Put more bits on a
track or interleave the data across multiple disk heads. And in a
system, the number of active prefetches is naturally limited by the
number of disk arms you have.
That disk analogy is flaved. AFAIK there is no penalty for choosing
"far away" pages compared to "near" ones (if any the opposite: Row
Hammer shows that accesses to "near" pages mean that given page may
need more frequent refresh). In case of memory, time spent in
memory controller is non-negligible, at least for accesses within
single page. AFAIK random access to lines withing single page
costs no more than sequential access, for disk you want sequential
access to a single track.
Yes, but these are second order compared to the difference between
latency and bandwidth on a disk.
This whole think has spiraled into something far beyond what I expected,
and what, I think, is useful. So unless you want me to, I probably
won't respond further.
BGB <cr88192@gmail.com> posted:
On 3/1/2026 12:02 PM, Terje Mathisen wrote:
BGB wrote:
On 2/28/2026 9:48 AM, Terje Mathisen wrote:
This reminds me of CABAC decoding in h264, where the output of the
arithmetic decoder is single bits that by definition cannot be
predictable, but the codec typically uses that bit to branch.
Yeah.
Making arithmetic and range coders fast is also hard.
I don't often use them as much because I am not aware of a good way to >>>> make them fast.
This is part of why I had often ended up going for STF+AdRice or
similar, which, while not the best in terms of compression, can be one >>>> of the faster options in many cases.
Theoretically, table-driven Huffman could be faster, but likewise
often suffers from cache misses (cycles lost to cache misses can
outweigh the cost of the more complex logic of an AdRice coder).
Huffman speed can be improved by reducing maximum symbol length and
table size, but then this can lose much its compression advantage.
Say, max symbol length:
-a-a 10/11: Too short, limits effectiveness.
-a-a 12: OK, leans faster;
-a-a 13: OK, leans better compression;
-a-a 14: Intermediate
-a-a 15: Slower still (Deflate is here)
-a-a 16: Slower (T.81 JPEG is here)
Where, for 12/13 bits, the fastest strategy is typically to use a
single big lookup table for the entropy decoder.
For 15 or 16 bits, it is often faster to have a separate fast-path and >>>> slow path. Say, fast path matches on the first 9 or 10 bits, and then
the slow path falls back to a linear search (over the longer symbols).
I have looked at multi-level table lookups, where the symbol either is
the one you want (short codes) or an index into a list of secondary
tables to be used on the remaining bits.
Can work OK if one assumes all of the longer codes are prefixed by a
longish series of 1s, which is usually but not necessarily true.
In HW any pattern can be used. In SW only patterns that are almost
satisfied by the current ISA can be considered. Big difference.
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 59 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 00:17:01 |
| Calls: | 810 |
| Files: | 1,287 |
| Messages: | 197,583 |