Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 35 |
Nodes: | 6 (1 / 5) |
Uptime: | 18:51:02 |
Calls: | 321 |
Calls today: | 1 |
Files: | 957 |
Messages: | 82,382 |
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
[fedora-starfive:/tmp:111378] cat x.c
#include <string.h>
long uload(long *p)
{
long x;
memcpy(&x,p,sizeof(long));
return x;
}
[fedora-starfive:/tmp:111379] gcc -O -S x.c
[fedora-starfive:/tmp:111380] cat x.s
.file "x.c"
.option nopic
.text
.align 1
.globl uload
.type uload, @function
uload:
addi sp,sp,-16
lbu t1,0(a0)
[...]
With RISC-V, nobody ever knows what architecture he is compiling for...
Did you tell gcc specifically that unsigned access was supported in
the architecture you were using?
But even assuming that I want to generate code tuned for RISC-V implementations where unaligned accesses are implemented so slowly
that I would prefer that code containing only aligned accesses is
generated, I would expect a compiler for which the memcpy workaround
is recommended (such as gcc) to do better, much better than gcc
actually does, e.g., something along the lines of:
http://gcc.gnu.org/bugzilla is your friend.
MitchAlsup1 wrote:
Basically, VAX taught us why we did not want to do "all that" in
a single instruction; while Intel 432 taught us why we did not bit
aligned decoders (and a lot of other things).
I case people are interested...
[paywalled]
The Instruction Decoding Unit for the VLSI 432 General Data Processor,
1981
https://ieeexplore.ieee.org/abstract/document/1051633/
The benchmarks in table 1(a) below tell it all:
a 4 MHz 432 is 1/15 to 1/20 the speed (slower) than a 5 MHz VAX/780,
1/4 to 1/7 speed than a 8 MHz 68000 or 5 MHz 8086
A Performance Evaluation of The Intel iAPX 432, 1982 https://dl.acm.org/doi/pdf/10.1145/641542.641545
And the reasons are covered here:
Performance Effects of Architectural Complexity in the Intel 432, 1988 https://www.princeton.edu/~rblee/ELE572Papers/Fall04Readings/I432.pdf
Bob Colwell, one of the authors of the third paper, later joined
Intel as a senior architect and was involved in the development of the
P6 core used in the Pentium Pro, Pentium II, and Pentium III
microprocessors,
and designs derived from it are used in the Pentium M, Core Duo and
Core Solo, and Core 2.
Anton Ertl wrote:
There are lots of potentially unaligned loads and stores. There are
very few actually unaligned loads and stores: On Linux-Alpha every
unaligned access is logged by default, and the number of
unaligned-access entries in the logs of our machines was relatively
small (on average a few per day). So trapping actual unaligned
accesses was faster than replacing potential unaligned accesses with
code sequences that synthesize the unaligned access from aligned
accesses.
Of course, if the cost of unaligned accesses is that high, you will
avoid them in cases like block copies where cheap unaligned accesses
would otherwise be beneficial.
- anton
That is fine for code that is being actively maintained and backward
data structure compatibility is not required (like those inside a kernel).
However for x86 there was a few billion lines of legacy code that likely >assumed 2-byte alignment, or followed the fp64 aligned to 32-bits advice,
and a C language that mandates structs be laid out in memory exactly as >specified (no automatic struct optimization).
Also I seem to recall some
amount of squawking about SIMD when it required naturally aligned buffers.
As SIMD no longer requires alignment, presumably code no longer does so.
Also in going from 32 to 64 bits, data structures that contain pointers
now could find those 8-byte pointers aligned on 4-byte boundaries.
While the Linux kernel may not use many misaligned values,
I'd guess there is a lot of application code that does.
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 4 Feb 2025 4:49:57 +0000, EricP wrote:
MitchAlsup1 wrote:
Basically, VAX taught us why we did not want to do "all that" in
a single instruction; while Intel 432 taught us why we did not bit
aligned decoders (and a lot of other things).
I case people are interested...
[paywalled]
The Instruction Decoding Unit for the VLSI 432 General Data Processor,
1981
https://ieeexplore.ieee.org/abstract/document/1051633/
The benchmarks in table 1(a) below tell it all:
a 4 MHz 432 is 1/15 to 1/20 the speed (slower) than a 5 MHz VAX/780,
1/4 to 1/7 speed than a 8 MHz 68000 or 5 MHz 8086
A Performance Evaluation of The Intel iAPX 432, 1982
https://dl.acm.org/doi/pdf/10.1145/641542.641545
And the reasons are covered here:
Performance Effects of Architectural Complexity in the Intel 432, 1988
https://www.princeton.edu/~rblee/ELE572Papers/Fall04Readings/I432.pdf
From the link::
The 432’s procedure calls are quite costly. A typical procedure call
requires 16 read accesses to memory and 24 write accesses, and it
consumes 982 machine cycles. In terms of machine cycles, this makes
it about ten times as slow as a call on the MC68010 or VAX 11/780.
almost 1000 cycles just to call a subroutine !!!
Lots of thinigs teh architects got wrong in there.....
While true, it's easy to say in retrospect after forty+
years of advancements in silicon design and technology.
Comparing to the CISC architectures of the 60s and 70s,
it's not horrible.
On Tue, 4 Feb 2025 4:49:57 +0000, EricP wrote:
MitchAlsup1 wrote:
Basically, VAX taught us why we did not want to do "all that" in
a single instruction; while Intel 432 taught us why we did not bit
aligned decoders (and a lot of other things).
I case people are interested...
[paywalled]
The Instruction Decoding Unit for the VLSI 432 General Data Processor,
1981
https://ieeexplore.ieee.org/abstract/document/1051633/
The benchmarks in table 1(a) below tell it all:
a 4 MHz 432 is 1/15 to 1/20 the speed (slower) than a 5 MHz VAX/780,
1/4 to 1/7 speed than a 8 MHz 68000 or 5 MHz 8086
A Performance Evaluation of The Intel iAPX 432, 1982
https://dl.acm.org/doi/pdf/10.1145/641542.641545
And the reasons are covered here:
Performance Effects of Architectural Complexity in the Intel 432, 1988
https://www.princeton.edu/~rblee/ELE572Papers/Fall04Readings/I432.pdf
From the link::
The 432’s procedure calls are quite costly. A typical procedure call >requires 16 read accesses to memory and 24 write accesses, and it
consumes 982 machine cycles. In terms of machine cycles, this makes
it about ten times as slow as a call on the MC68010 or VAX 11/780.
almost 1000 cycles just to call a subroutine !!!
Lots of thinigs teh architects got wrong in there.....
Thomas Koenig <tkoenig@netcologne.de> writes:
http://gcc.gnu.org/bugzilla is your friend.
In my experience it's a waste of time:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93765
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93811
mitchalsup@aol.com (MitchAlsup1) writes:
almost 1000 cycles just to call a subroutine !!!
Lots of thinigs teh architects got wrong in there.....
While true, it's easy to say in retrospect after forty+
years of advancements in silicon design and technology.
Comparing to the CISC architectures of the 60s and 70s,
it's not horrible.
According to MitchAlsup1 <mitchalsup@aol.com>:
while Intel 432 taught us why we did not bitIt was certainly an interesting experiment in yet another way that
aligned decoders (and a lot of other things).
Intel wanted programmers to use their computers and the programmers
said, naah.
something along the lines of:
uload:
// addi a5,a0,7 // unnecessary
andi a4,a0,-8
// andi a5,a5,-8 // unnecessary
ld a2,8(a4) // load higher
ld a3,0(a4) // load lower
neg a4,a0
andi a4,a4,7
andi a0,a0,7
slliw a4,a4,3
slliw a5,a0,3
sll a5,a3,a5
sra a0,a2,a4
or a0,a0,a5
ret
Fewer instructions, and also a better distribution between various
functional units.
IIRC it's only three instructions on MIPS and five instructions on
Alpha, but they have special instructions for this case, because they
were designed for it, whereas RISC-V was designed to have unaligned
accesses.
- anton
On 2/4/2025 1:25 PM, Scott Lurndal wrote:-------------------
mitchalsup@aol.com (MitchAlsup1) writes:
Comparing to the CISC architectures of the 60s and 70s,
it's not horrible.
Well, vs a modern RISC style ISA, say, caller side:
MOV R20, R10 //0c (SSC with following)
MOV R21, R11 //1c
BSR func //2c (typically)
Cost: 3 cycles.
func:
ADD SP, -32, SP //2c (1 c penalty)
MOV.Q LR, (SP, 24) //1c
MOV.X R18, (SP, 0) //1c
...
MOV.Q (SP, 24), LR //2c (1c penalty)
MOV.X (SP, 0), R18 //1c
JMP LR //10c (*1)
*1: Insufficient delay since LR reload, so branch predictor fails to
handle this case.
Cost: 16 cycles.
....
In article <vnrrmg$2adb$1@gal.iecc.com>, johnl@taugh.com (John Levine)
wrote:
According to MitchAlsup1 <mitchalsup@aol.com>:
while Intel 432 taught us why we did not bitIt was certainly an interesting experiment in yet another way that
aligned decoders (and a lot of other things).
Intel wanted programmers to use their computers and the programmers
said, naah.
It didn't get that far. There were no low-cost i432 systems, so the
ingenious software developers of the early 1980s carried on using more conventional microprocessors.
The DoD wanted ADA, but the new software companies of the period
weren't especially interested in selling to them. Making money in the civilian business software and games markets was far easier and more
fun.
John
EricP <ThatWouldBeTelling@thevillage.com> writes:
As SIMD no longer requires alignment, presumably code no longer does so.
Yes, if you use AVX/AVX2, you don't encounter this particular Intel stupidity.
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
Telling the rust compiler to target my AVX2-capable laptop CPU (an Intel
i7)
I got code that simply amazed me: The compiler unrolled the inner
loop by 32, ANDing 4 x 8 keys by 8 copies of the current lock into 4 AVX >registers (vpand), then comparing with a zeroed register (vpcmpeqd) >(generating -1/0 results) before subtracting (vpsubd) those from 4 >accumulators.
There was no attempt to check for 32-byte algnment, it all just worked. :-)
Terje Mathisen <terje.mathisen@tmsw.no> writes:
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
Does Rust only have this roundabout way to express this sequentially?
In Forth I would express that scalarly as
( part1 ) li 0 do
keylocks i th @ lock and 0= - loop
["-" because 0= produces all-bits-set (-1) for true]
or in C as
for (k=0; k<li; k++)
part1 += (lock & keylocks[k])==0;
which I find much easier to follow. I also expected 0..li to include
li (based on, I guess, the of .. in Pascal and its descendents), but
the net tells me that it does not (starting with 0 was the hint that
made me check my expectations).
Telling the rust compiler to target my AVX2-capable laptop CPU (an Intel
i7)
I find it deplorable that even knowledgeable people use marketing
labels like "i7" which do not tell anything technical (and very little non-technical) rather than specifying the full model number (e.g, Core i7-1270P) or the design (e.g., Alder Lake). But in the present case "AVX2-capable CPU" is enough information.
I got code that simply amazed me: The compiler unrolled the inner
loop by 32, ANDing 4 x 8 keys by 8 copies of the current lock into 4 AVX
registers (vpand), then comparing with a zeroed register (vpcmpeqd)
(generating -1/0 results) before subtracting (vpsubd) those from 4
accumulators.
If you have ever learned about vectorization, it's easy to see that
the inner loop can be vectorized. And obviously auto-vectorization
has worked in this case, not particularly amazing to me.
clang is somewhat better:
For the avx2 case, 70 lines and 250 bytes.
For the x86-64-v4 case, 111 lines and 435 byes.
Do you mean that there were high-cost i432 systems?
Not dissimilar to Merced 17-18 years later except that number of the
systems that was given away in early 80s was probably 3 orders of
magnitude lower than in late 90s.
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
As SIMD no longer requires alignment, presumably code no longer
does so.
Yes, if you use AVX/AVX2, you don't encounter this particular Intel stupidity.
Recently, on the last day (Dec 25th) of Advent of Code, I had a
problem which lent itself to using 32-bit bitmaps: The task was to
check which locks were compatible with which keys, so I ended up with
code like this:
let mut part1 = 0;
for l in li..keylocks.len() {
let lock = keylocks[l];
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
}
Telling the rust compiler to target my AVX2-capable laptop CPU (an
Intel i7), I got code that simply amazed me: The compiler unrolled
the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
lock into 4 AVX registers (vpand), then comparing with a zeroed
register (vpcmpeqd) (generating -1/0 results) before subtracting
(vpsubd) those from 4 accumulators.
This resulted in just 12 instructions to handle 32 tests.
The final code, with zero unsafe/asm/intrinsics, took 5.8
microseconds to run all the needed parsing/setup/initialization and
then test 62500 combinations, so just 93 ps per key/lock test!
There was no attempt to check for 32-byte algnment, it all just
worked. :-)
The task is of course embarrassingly parallelizable, but I suspect
the overhead of starting 4 or 8 threads will be higher than what I
would save? I guess I'll have to test!
Terje
While the Linux kernel may not use many misaligned values,
I'd guess there is a lot of application code that does.
For:...
unsigned long inner(unsigned long li, unsigned lock, unsigned keylocks[], unsigned long part1)
{
unsigned long k;
for (k=0; k<li; k++)
part1 += (lock & keylocks[k])==0;
return part1;
}
gcc -Wall -O3 -mavx2 -c x.c && objdump -d x.o
produces 109 lines of disassembly output (which I will spare you),
with a total length of 394 bytes.
clang is somewhat better:
For the avx2 case, 70 lines and 250 bytes.
Anton Ertl wrote:
If you have ever learned about vectorization, it's easy to see that
the inner loop can be vectorized. And obviously auto-vectorization
has worked in this case, not particularly amazing to me.
I have some (30 years?) experience with auto-vectorization, usually I've
been (very?) disappointed.
As I wrote this was the best I have ever
seen, and the resulting code actually performed extremely close to >theoretical speed of light, i.e. 3 clock cycles for each 3 avx instruction.
This resulted in just 12 instructions to handle 32 tests.
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number of
memory accesses per comparison.
The speed up would depend on specific
microarchiture, but I would guess that at least 1.2x speedup is here.
Concerning SIMD: trouble here is increasing vector length and
consequently increasing alignment requirements.
A lot of SIMD
code is memory-bound and current way of doing misaligned
access leads to worse performance. So really no good way
to solve this. In principle set of buffers for 2 cache lines
each and appropriate shifters could give optimal troughput,
but probably would lead to increased latency.
EricP <ThatWouldBeTelling@thevillage.com> wrote:
While the Linux kernel may not use many misaligned values,
I'd guess there is a lot of application code that does.
I guess that much of that is simply "by accident" because
without alignment checks in hadware misalignemnt may happen
and nobody notices that there is small performance problem.
I worked on a low level program and reasonably recent I did get
bunch of alignment errors. On AMD64 they were due to SSE
instructions used by 'memcpy', on 32-bit ARM due to use of double
precision floating point in 'memcpy'. It took some time to find
them, simply most things worked even without alignment and the
offending cases were hard to trigger.
My personal feeling is that best machine would have aligned
access with checks by default, but also special instructions
for unaligned access. That way code that does not need
unaligned access gets extra error checking, while code that
uses unaligned access pays modest, essentially unavoidable
penalty.
Of course, once architecture officially supports unaligned
access, there will be binaries depending on this and backward
compatibility will prevent change to require alignment.
Concerning SIMD: trouble here is increasing vector length and
consequently increasing alignment requirements. A lot of SIMD
code is memory-bound and current way of doing misaligned
access leads to worse performance. So really no good way
to solve this. In principle set of buffers for 2 cache lines
each and appropriate shifters could give optimal troughput,
but probably would lead to increased latency.
On Wed, 5 Feb 2025 18:10:03 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
As SIMD no longer requires alignment, presumably code no longer
does so.
Yes, if you use AVX/AVX2, you don't encounter this particular Intel
stupidity.
Recently, on the last day (Dec 25th) of Advent of Code, I had a
problem which lent itself to using 32-bit bitmaps: The task was to
check which locks were compatible with which keys, so I ended up with
code like this:
let mut part1 = 0;
for l in li..keylocks.len() {
let lock = keylocks[l];
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
}
Telling the rust compiler to target my AVX2-capable laptop CPU (an
Intel i7), I got code that simply amazed me: The compiler unrolled
the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
lock into 4 AVX registers (vpand), then comparing with a zeroed
register (vpcmpeqd) (generating -1/0 results) before subtracting
(vpsubd) those from 4 accumulators.
This resulted in just 12 instructions to handle 32 tests.
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number of
memory accesses per comparison. The speed up would depend on specific microarchiture, but I would guess that at least 1.2x speedup is here. Especially so when data is not aligned.
Michael S wrote:
On Wed, 5 Feb 2025 18:10:03 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
As SIMD no longer requires alignment, presumably code no longer
does so.
Yes, if you use AVX/AVX2, you don't encounter this particular
Intel stupidity.
Recently, on the last day (Dec 25th) of Advent of Code, I had a
problem which lent itself to using 32-bit bitmaps: The task was to
check which locks were compatible with which keys, so I ended up
with code like this:
let mut part1 = 0;
for l in li..keylocks.len() {
let lock = keylocks[l];
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
}
Telling the rust compiler to target my AVX2-capable laptop CPU (an
Intel i7), I got code that simply amazed me: The compiler unrolled
the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
lock into 4 AVX registers (vpand), then comparing with a zeroed
register (vpcmpeqd) (generating -1/0 results) before subtracting
(vpsubd) those from 4 accumulators.
This resulted in just 12 instructions to handle 32 tests.
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number
of memory accesses per comparison. The speed up would depend on
specific microarchiture, but I would guess that at least 1.2x
speedup is here. Especially so when data is not aligned.
Anton already replied, as he wrote the total loop overhead is just
three instructions, all of which can (& will?) overlap with the AVX instructions.
Due to the combined AVX and 4x unroll, the original scalar code is
alreayd unrolled 32 x, so the loop overhead can mostly be ignored.
If the cpu has enough resources to run more than one 32-byte AVX
instruction per cycle, then the same code will allow all four copies
to run at the same time, but the timing I see on my laptop (93 ps) corresponds closely to one AVX op/cycle.
Terje
Michael S <already5chosen@yahoo.com> writes:
This resulted in just 12 instructions to handle 32 tests.
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number
of memory accesses per comparison.
Looking at the inner loop code shown in <2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do
not include the loop overhead and are already unrolled by a factor of
4 (32 for the scalar code). The loop overhead is 3 instructions, for
a total of 15 instructions per iteration.
The speed up would depend on specific
microarchiture, but I would guess that at least 1.2x speedup is
here.
Even if you completely eliminate the loop overhead, the number of instructions is reduced by at most a factor 1.25, and I expect that
the speedup from further unrolling is a factor of at most 1 on most
CPUs (factor <1 can come from handling the remaining elements slowly,
which does not seem unlikely for code coming out of gcc and clang).
- anton
On Thu, 06 Feb 2025 10:59:39 GMT...
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
This resulted in just 12 instructions to handle 32 tests.
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number
of memory accesses per comparison.
The point of my proposal is not reduction of loop overhead and not
reduction of the # of x86 instructions (in fact, with my proposal the #
of x86 instructions is increased), but reduction in # of uOps due to
reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that flows >through rename stage.
Not counting loop overhead, an original 1x4 inner loop consists of 12 >instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that
does the same amount of work. New inner loop contains only RISC-like >instructions - 14 instructions, 14 uOps.
With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more
work done per iteration.
Another factor that can contribute to a speedup is increased number
of iterations in the inner loop - from 1..7 iterations in original to
1..15 in both of my above mentioned variants.
Yet another possibility is to follow "work harder not smarter"
principle, i.e. process the whole square rather than just a relevant >triangle.
Michael S <already5chosen@yahoo.com> writes:
On Thu, 06 Feb 2025 10:59:39 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
...Michael S <already5chosen@yahoo.com> writes:
This resulted in just 12 instructions to handle 32 tests.
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the
number of memory accesses per comparison.
The point of my proposal is not reduction of loop overhead and not >reduction of the # of x86 instructions (in fact, with my proposal
the # of x86 instructions is increased), but reduction in # of uOps
due to reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that
flows through rename stage.
Not counting loop overhead, an original 1x4 inner loop consists of 12 >instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that
does the same amount of work. New inner loop contains only RISC-like >instructions - 14 instructions, 14 uOps.
With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more
work done per iteration.
I completely missed the "outer" in your response. Yes, looking at the original loop again:
let mut part1 = 0;
for l in li..keylocks.len() {
let lock = keylocks[l];
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
}
you can reuse the keylocks[k] value by unrolling the outer loop.
E.g., if you unroll the outer loop by a factor of 4 (and the inner
loop not beyond getting SIMD width), you can use almost the same code
as clang produces, but you load keylocks[0..li] 4 times less often,
and if the bottleneck for the inner-loop-only-optimized variant is
bandwidth to the memory subsystem (it seems that Terje Mathisen worked
with up to 62500 values, i.e., 250KB, i.e. L2), which is likely, there
may be quite a bit of speedup.
E.g., for Zen5 the bandwidth to L2 is reported to be 32 bytes/cycle,
which would limit the performance to need at least 4 cycles/iteration
(3.75 IPC), possibly less due to misalignment handling overhead, and
using AVX-512 would not help, whereas with reusing the loaded value
the limit would probably be resources, and AVX-512 would see quite a
bit of speedup over AVX2.
Another factor that can contribute to a speedup is increased number
of iterations in the inner loop - from 1..7 iterations in original to
1..15 in both of my above mentioned variants.
Yes. I actually don't see a reason to unroll the inner loop more than
needed for the SIMD instructions at hand, unless the number of
outer-loop iterations is too small. If you want more unrolling,
unroll the outer loop more.
Yet another possibility is to follow "work harder not smarter"
principle, i.e. process the whole square rather than just a relevant >triangle.
I don't see a triangle in the code above. There may be some more
outer loop involved that varies li from 0 to keylocks.len() or
something, but the code that is presented processes a square.
- anton
On Thu, 06 Feb 2025 10:59:39 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
This resulted in just 12 instructions to handle 32 tests.
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number
of memory accesses per comparison.
Looking at the inner loop code shown in
<2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do
not include the loop overhead and are already unrolled by a factor of
4 (32 for the scalar code). The loop overhead is 3 instructions, for
a total of 15 instructions per iteration.
The speed up would depend on specific
microarchiture, but I would guess that at least 1.2x speedup is
here.
Even if you completely eliminate the loop overhead, the number of
instructions is reduced by at most a factor 1.25, and I expect that
the speedup from further unrolling is a factor of at most 1 on most
CPUs (factor <1 can come from handling the remaining elements slowly,
which does not seem unlikely for code coming out of gcc and clang).
- anton
The point of my proposal is not reduction of loop overhead and not
reduction of the # of x86 instructions (in fact, with my proposal the #
of x86 instructions is increased), but reduction in # of uOps due to
reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that flows through rename stage.
Not counting loop overhead, an original 1x4 inner loop consists of 12 instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that
does the same amount of work. New inner loop contains only RISC-like instructions - 14 instructions, 14 uOps.
With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more
work done per iteration.
Another factor that can contribute to a speedup is increased number
of iterations in the inner loop - from 1..7 iterations in original to
1..15 in both of my above mentioned variants.
Yet another possibility is to follow "work harder not smarter"
principle, i.e. process the whole square rather than just a relevant triangle. The main gain is that loop detector would be able to predict
the # of iterations in the inner loop, avoiding mispredicted branch at
the end. If we follow this pass then it probably makes sense to
not unroll an inner loop beyond SIMD factor of 8 and instead unroll an
outer loop by 4.
Going by intuition, in this particular application "smarter" wins
over "harder", but we know that intuition sucks. Including mine :(
antispam@fricas.org (Waldek Hebisch) writes:
Concerning SIMD: trouble here is increasing vector length and
consequently increasing alignment requirements.
That is not a necessary consequence, on the contrary: alignment
requirements based on SIMD granularity is hardware designer lazyness,
but means that SIMD cannot be used for many of the applications where
SIMD without that limitation can be used.
If you want to have alignment checks, then a SIMD instruction should
check for element alignment, not for SIMD alignment.
But the computer architecture trend is clear: General-purpose
computers do not have alignment restrictions; all that had them have
been discontinued; the last one that had them was SPARC.
A lot of SIMD
code is memory-bound and current way of doing misaligned
access leads to worse performance. So really no good way
to solve this. In principle set of buffers for 2 cache lines
each and appropriate shifters could give optimal troughput,
but probably would lead to increased latency.
AFAIK that's what current microarchitectures do, and in many cases
with small penalties for unaligned accesses; see https://www.complang.tuwien.ac.at/anton/unaligned-stores/
Michael S wrote:
The point of my proposal is not reduction of loop overhead and not
reduction of the # of x86 instructions (in fact, with my proposal the #
of x86 instructions is increased), but reduction in # of uOps due to
reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that flows
through rename stage.
Aha! I see what you mean: Yes, this would be better if the
VPAND reg,reg,[mem]
instructions actually took more than one cycle each, but as the size of
the arrays were just 1000 bytes each (250 keys + 250 locks), everything
fits easily in $L1. (BTW, I did try to add 6 dummy keys and locks just
to avoid any loop end overhead, but that actually ran slower.)
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
But the computer architecture trend is clear: General-purpose
computers do not have alignment restrictions; all that had them have
been discontinued; the last one that had them was SPARC.
Trend is clear, but there is a question: is it good trend.
You wrot about lazy hardware designers, but there is much
more lazy programmers.
There are situations when unaligned
access is needed, but significant proportion of unaligned
accesses is not needed at all.
At best such unaligned
accesses lead to small performance loss,
There are cases when unaligned accesses
are better than aligned ones, for that architecture
should have apropriate instructions.
You call doubling store time 'small penalty'. For me in
performance critical loop 10% matter and it is worth
aligning things to avoid such loss.
For me much more important are loads.
them. Second, stores can be buffered and latency of store itself
is of little importance (latency from store to load matters).
For loads extra things in load path increase latency and that
may limit program speed.
Terje Mathisen wrote:
Michael S wrote:
The point of my proposal is not reduction of loop overhead and not
reduction of the # of x86 instructions (in fact, with my proposal
the # of x86 instructions is increased), but reduction in # of
uOps due to reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that
flows through rename stage.
Aha! I see what you mean: Yes, this would be better if the
VPAND reg,reg,[mem]
instructions actually took more than one cycle each, but as the
size of the arrays were just 1000 bytes each (250 keys + 250
locks), everything fits easily in $L1. (BTW, I did try to add 6
dummy keys and locks just to avoid any loop end overhead, but that
actually ran slower.)
I've just tested it by running either 2 or 4 locks in parallel in the
inner loop: The fastest time I saw actually did drop a smidgen, from
5800 ns to 5700 ns (for both 2 and 4 wide), with 100 ns being the
timing resolution I get from the Rust run_benchmark() function.
So yes, it is slightly better to run a stripe instead of just a
single row in each outer loop.
Terje
On Thu, 6 Feb 2025 17:47:30 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Terje Mathisen wrote:
Michael S wrote:
The point of my proposal is not reduction of loop overhead and not
reduction of the # of x86 instructions (in fact, with my proposal
the # of x86 instructions is increased), but reduction in # of
uOps due to reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that
flows through rename stage.
Aha! I see what you mean: Yes, this would be better if the
 VPAND reg,reg,[mem]
instructions actually took more than one cycle each, but as the
size of the arrays were just 1000 bytes each (250 keys + 250
locks), everything fits easily in $L1. (BTW, I did try to add 6
dummy keys and locks just to avoid any loop end overhead, but that
actually ran slower.)
I've just tested it by running either 2 or 4 locks in parallel in the
inner loop: The fastest time I saw actually did drop a smidgen, from
5800 ns to 5700 ns (for both 2 and 4 wide), with 100 ns being the
timing resolution I get from the Rust run_benchmark() function.
So yes, it is slightly better to run a stripe instead of just a
single row in each outer loop.
Terje
Assuming that your CPU is new and runs at decent frequency (4-4.5 GHz),
the results are 2-3 times slower than expected. I would guess that it
happens because there are too few iterations in the inner loop.
Turning unrolling upside down, as I suggested in the previous post,
should fix it.
Very easy to do in C with intrinsic. Probably not easy in Rust.
Michael S wrote:
On Thu, 6 Feb 2025 17:47:30 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Terje Mathisen wrote:
Michael S wrote:
The point of my proposal is not reduction of loop overhead and
not reduction of the # of x86 instructions (in fact, with my
proposal the # of x86 instructions is increased), but reduction
in # of uOps due to reuse of loaded values.
The theory behind it is that most typically in code with very
high IPC like the one above the main bottleneck is the # of uOps
that flows through rename stage.
Aha! I see what you mean: Yes, this would be better if the
  VPAND reg,reg,[mem]
instructions actually took more than one cycle each, but as the
size of the arrays were just 1000 bytes each (250 keys + 250
locks), everything fits easily in $L1. (BTW, I did try to add 6
dummy keys and locks just to avoid any loop end overhead, but that
actually ran slower.)
I've just tested it by running either 2 or 4 locks in parallel in
the inner loop: The fastest time I saw actually did drop a
smidgen, from 5800 ns to 5700 ns (for both 2 and 4 wide), with 100
ns being the timing resolution I get from the Rust run_benchmark()
function.
So yes, it is slightly better to run a stripe instead of just a
single row in each outer loop.
Terje
Assuming that your CPU is new and runs at decent frequency (4-4.5
GHz), the results are 2-3 times slower than expected. I would guess
that it happens because there are too few iterations in the inner
loop. Turning unrolling upside down, as I suggested in the previous
post, should fix it.
Very easy to do in C with intrinsic. Probably not easy in Rust.
I did mention that this is a (cheap) laptop? It is about 15 months
old, and with a base frequency of 2.676 GHz.
I guess that would
explain most of the difference between what I see and what you
expected?
BTW, when I timed 1000 calls to that 5-6 us program, to get around
teh 100 ns timer resolution, each iteration ran in 5.23 us.
Terje
On 2/6/2025 2:36 PM, Terje Mathisen wrote:
Michael S wrote:
On Thu, 6 Feb 2025 17:47:30 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
FWIW: The idea of running a CPU at 4+ GHz seems a bit much (IME, CPUs
tend to run excessively hot at these kinds of clock speeds; 3.2 to 3.6 seemingly more reasonable so that it "doesn't melt", or have thermal throttling or stability issues).
A smaller pagefile still exists on the SSD, but mostly because Windows
is unhappy if there is no pagefile on 'C'. Don't generally want a
pagefile on an SSD though as it is worse for lifespan (but, it is 8GB,
which Windows accepts; with around 192GB each on the other drives, for ~ 400GB of swap space).
Not sure how well Windows load-balances swap, apparently not very well
though (when it starts paging, most of the load seems to be on one
drive; better if it could give a more even spread).
The SSD seems to get ~ 300 MB/sec.
....
Michael S wrote:
On Thu, 6 Feb 2025 21:36:38 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
BTW, when I timed 1000 calls to that 5-6 us program, to get around
teh 100 ns timer resolution, each iteration ran in 5.23 us.
That measurement could be good enough on desktop. Or not.
It certainly not good enough on laptop and even less so on server.
On laptop I wouldn't be sutisfied before I lok my program to
particualr core, then do something like 21 measurements with 100K
calls in each measurement (~10 sec total) and report median of 21.
Each measurement did 1000 calls, then I ran 100 such measurements.
The 5.23 us value was the lowest seen among the 100, with average a
bit more:
Slowest: 9205200 ns
Fastest: 5247500 ns
Average: 5672529 ns/iter
Part1 = 3338
My own (old, but somewhat kept up to date) cputype program reported
that it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to CPUID.
Is that sufficient to judge the performance?
Terje
On Fri, 7 Feb 2025 11:06:43 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Thu, 6 Feb 2025 21:36:38 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
BTW, when I timed 1000 calls to that 5-6 us program, to get around
teh 100 ns timer resolution, each iteration ran in 5.23 us.
That measurement could be good enough on desktop. Or not.
It certainly not good enough on laptop and even less so on server.
On laptop I wouldn't be sutisfied before I lok my program to
particualr core, then do something like 21 measurements with 100K
calls in each measurement (~10 sec total) and report median of 21.
Each measurement did 1000 calls, then I ran 100 such measurements.
The 5.23 us value was the lowest seen among the 100, with average a
bit more:
Slowest: 9205200 ns
Fastest: 5247500 ns
Average: 5672529 ns/iter
Part1 = 3338
My own (old, but somewhat kept up to date) cputype program reported
that it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to CPUID.
Is that sufficient to judge the performance?
Terje
Not really.
i7-1365U is a complicated beast. 2 "big" cores, 8 "medium" cores.
Frequency varies ALOT, 1.8 to 5.2 GHz on "big", 1.3 to 3.9 GHz on
"medium".
As I said above, on such CPU I wouldn't believe the numbers before
total duration of test is 10 seconds and the test run is locked to
particular core. As to 5 msec per measurement, that's enough, but why
not do longer measurements if you have to run for 10 sec anyway?
Michael S wrote:
On Fri, 7 Feb 2025 11:06:43 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Thu, 6 Feb 2025 21:36:38 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
BTW, when I timed 1000 calls to that 5-6 us program, to get
around teh 100 ns timer resolution, each iteration ran in 5.23
us.
That measurement could be good enough on desktop. Or not.
It certainly not good enough on laptop and even less so on server.
On laptop I wouldn't be sutisfied before I lok my program to
particualr core, then do something like 21 measurements with 100K
calls in each measurement (~10 sec total) and report median of
21.
Each measurement did 1000 calls, then I ran 100 such measurements.
The 5.23 us value was the lowest seen among the 100, with average a
bit more:
Slowest: 9205200 ns
Fastest: 5247500 ns
Average: 5672529 ns/iter
Part1 = 3338
My own (old, but somewhat kept up to date) cputype program reported
that it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to
CPUID.
Is that sufficient to judge the performance?
Terje
Not really.
i7-1365U is a complicated beast. 2 "big" cores, 8 "medium" cores.
Frequency varies ALOT, 1.8 to 5.2 GHz on "big", 1.3 to 3.9 GHz on
"medium".
OK. It seems like the big cores are similar to what I've had
previously, i.e. each core supports hyperthreading, while the medium
ones don't. This results in 12 HW threads.
As I said above, on such CPU I wouldn't believe the numbers before
total duration of test is 10 seconds and the test run is locked to particular core. As to 5 msec per measurement, that's enough, but
why not do longer measurements if you have to run for 10 sec
anyway?
The Advent of Code task required exactly 250 keys and 250 locks to be
tested, this of course fits easily in a corner of $L1 (2000 bytes).
The input file to be parsed was 43*500 = 21500 bytes long, so this
should also fit in $L1 when I run repeated tests.
Under Windows I can set thread affinity to lock a process to a given
core, but how do I know which are "Big" and "Medium"?
Terje
On Fri, 7 Feb 2025 15:23:51 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Under Windows I can set thread affinity to lock a process to a given
core, but how do I know which are "Big" and "Medium"?
Trial and error?
In the mean time.
I did few measurements on Xeon E3 1271 v3. That is rather old uArch - >Haswell, the first core that supports AVX2. During the tests it was
running at 4.0 GHz.
1. Original code (rewritten in plain C) compiled with clang -O3 >-march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
AVX2 code compiled with clang -O3 -march=skylake (AVX2)
Results were as following (usec/call)
1 - 5.66
2 - 5.56
3 - 2.18
If one runs their CPU at 4 GHz, then under multi-threaded load, it may
hit 70C or so, frequency starts jumping all over (as it tries to keep temperature under control), and sometimes the computer will crash.
On Fri, 7 Feb 2025 15:04:23 +0000, Michael S wrote:
res += _mm256_extract_epi32(res0, 0);
res += _mm256_extract_epi32(res0, 4);
return res;
Simple question:: how would you port this code to a machine
with a different SIMD instruction set ??
Michael S <already5chosen@yahoo.com> writes:
In the mean time.
I did few measurements on Xeon E3 1271 v3. That is rather old uArch -
Haswell, the first core that supports AVX2. During the tests it was
running at 4.0 GHz.
1. Original code (rewritten in plain C) compiled with clang -O3
-march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
AVX2 code compiled with clang -O3 -march=skylake (AVX2)
Results were as following (usec/call)
1 - 5.66
2 - 5.56
3 - 2.18
In the meantime, I also wrote the original code in plain C
(keylocks1.c), then implemented your idea of unrolling the outer loop
and comparing a subarray of locks to each key (is this called strip
mining?) in plain C (with the hope that auto-vectorization works) as keylocks2.c, and finally rewrote the latter version to use gcc vector extensions (keylocks3.c). I wrote a dummy main around that that calls
the routine 100_000 times; given that the original routine's
performance does not depend on the data, and I used non-0 keys (so keylocks[23].c does not skip any keys), the actual data is not
important.
You can find the source code and the binaries I measured at <http://www.complang.tuwien.ac.at/anton/keylock/>. The binaries were compiled with gcc 12.2.0 and (in the clang subdirectory) clang-14.0.6;
the clang compilations sometimes used different UNROLL factors than
the gcc compilations (and I am unsure, which, see below).
The original code is:
unsigned keylocks(unsigned keys[], unsigned nkeys, unsigned locks[], unsigned nlocks)
{
unsigned i, j;
unsigned part1 = 0;
for (i=0; i<nlocks; i++) {
unsigned lock = locks[i];
for (j=0; j<nkeys; j++)
part1 += (lock & keys[j])==0;
}
return part1;
}
For keylocks2.c the central loops are:
for (i=0; i<UNROLL; i++)
part0[i]=0;
for (i=0; i<nlocks1; i+=UNROLL) {
for (j=0; j<nkeys1; j++) {
unsigned key = keys1[j];
for (k=0; k<UNROLL; k++)
part0[k] += (locks1[i+k] & key)==0;
}
}
For UNROLL I tried 8, 16, and 32 for AVX2 and 16, 32, or 64 for
AVX-512; the numbers below are for those factors that produce the
lowest cycles on the Rocket Lake machine.
The central loops are preceded by code to arrange the data such that
this code works: locks are copied to the longer locks1; the length of
locks1 is a multiple of UNROLL, and the entries beyond nlocks are ~0
to increase the count by 0) and the keys are copies to keys1 (with 0
removed so that the extra locks are not counted, and that also may
increase efficiency if there is a key=0). The central loops are
followed by summing up the elements of part0.
keylocks3.c, which uses the gcc vector extensions, just changes
keylocks2.c in a few places. In particular, it adds a type vu:
typedef unsigned vu __attribute__ ((vector_size (UNROLL*sizeof(unsigned))));
The central loops now look as follows:
for (i=0; i<UNROLL; i++)
part0[i]=0;
for (i=0; i<nlocks1; i+=UNROLL) {
vu lock = *(vu *)(locks1+i);
for (j=0; j<nkeys1; j++) {
part0 -= (lock & keys1[j])==0;
}
One interesting aspect of the gcc vector extensions is that the result
of comparing two vectors is 0 (false) or ~0 (true) (per element),
whereas for scalars the value for true is 1. Therefore the code above updates part0 with -=, whereas in keylocks2.c += is used.
While the use of ~0 is a good choice when designing a new programming language, I would have gone for 1 in the case of a vector extension
for C, for consistency with the scalar case; in combination with
hardware that produces ~0 (e.g., Intel SSE and AVX SIMD stuff), that
means that the compiler will introduce a negation in its intermediate representation at some point; I expect that compilers will usually be
able to optimize this negation away, but I would not be surprised at
cases where my expectation is disappointed.
keylocks3.c compiles without warning on clang, but the result usually segfaults (but sometime does not, e.g., in the timed run on Zen4; it segfaults in other runs on Zen4). I have not investigated why this
happens, I just did not include results from runs where it segfaulted;
and I tried additional runs for keylocks3-512 on Zen4 in order to have
one result there.
I would have liked to compare the performance of my code against your
code, but your code apparently was destroyed by arbitrary line
breaking in your news-posting software. Anyway, here are my results.
First cycles (which eliminates worries about turbo modes) and
instructions, then usec/call.
The cores are:
Haswell: Core i7-4790K (similar to Michael S.'s CPU)
Golden Cove: Core i3-1315U (same P-core as Terje Mathisen's laptop) Gracemont: Core i3-1315U (same E-core as Terje Mathisen's laptop)
Rocket Lake: Xeon W-1370P (2021 Intel CPU)
Zen4: Ryzen 7 8700G (2024)
I also measured Tiger Lake (Core i5-1135G7, the CPU of a 2021 laptop),
but the results were very close to the Rocket Lake results, so because
of the limited table width, I do not show them.
The first three cores do not support AVX-512, the others do.
Cycles:
Haswell Golden Cove Gracemont Rocket Lake Zen4
1818_241431 1433_208539 1778_675623 2_365_664_737 1_677_853_186 gcc avx2 1 1051_191216 1189_869807 1872_856423 981_948_517 727_418_069 gcc avx2 2 8
1596_783872 1213_400891 2076_677426 1_690_280_182 913_746_088 gcc avx2 3 8
2195_438821 1638_006130 2577_451872 2_291_743_879 1_617_970_157 clang avx2 1 2757_454335 2151_198125 2506_427284 3_174_899_185 1_523_870_829 clang avx2 2 8?
638_374_463 clang avx2 3 8?
1_139_175_457 1_219_164_672 gcc 512 1
856_818_642 900_108_135 gcc 512 2 32
866_077_594 1_072_172_449 gcc 512 3 16
2_479_213_408 1_479_937_930 clang 512 1
912_273706 936_311567 847_289_380 634_826_441 clang 512 2 16?
636_278_210 clang 512 3 16?
avx2 means: compiled with -mavx2; 512 means: compiled with
-march=x86-64-v4 (I usually did not measure those on machines that do
not support AVX-512, because I expected the results to not work; I
later measured some clang's keylocks2-512 on some of those machines).
The number behind that ist the keylocks[123].c variant, and the number
behind that (if present) the UNROLL parameter. I am not sure about
the UNROLL numbers used for clang, but in any case I kept what
performed best on Rocket Lake. The number of instructions executed is (reported on the Zen4):
instructions
5_779_542_242 gcc avx2 1
3_484_942_148 gcc avx2 2 8
5_885_742_164 gcc avx2 3 8
7_903_138_230 clang avx2 1
7_743_938_183 clang avx2 2 8?
3_625_338_104 clang avx2 3 8?
4_204_442_194 gcc 512 1
2_564_142_161 gcc 512 2 32
3_061_042_178 gcc 512 3 16
7_703_938_205 clang 512 1
3_402_238_102 clang 512 2 16?
3_320_455_741 clang 512 3 16?
for gcc -mavx2 on keylocks3.c on Zen 4 an IPC of 6.44 is reported,
while microarchitecture descriptions report only a 6-wide renamer <https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine>.
My guess is that the front end combined some instructions (maybe
compare and branch) into a macro-op, and the renamer then processed 6 macro-ops that represented more instructions. The inner loop is
│190: vpbroadcastd (%rax),%ymm0
1.90 │ add $0x4,%rax
│ vpand %ymm2,%ymm0,%ymm0
1.09 │ vpcmpeqd %ymm3,%ymm0,%ymm0
0.41 │ vpsubd %ymm0,%ymm1,%ymm1
78.30 │ cmp %rdx,%rax
│ jne 190
and if the cmp and jne are combined into one macro-op, that would be
perfect for executing one iteration per cycle.
It's interesting that gcc's keylocks2-256 results on far fewer
instructions (and eventually, cycles). It unrolls the inner loop 8
times to process the keys in SIMD fashion, too, loading the keys one
ymm register at a time. In order to do that it arranges the locks in
8 different ymm registers in the outer loop, so the inner loop
performs 8 sequences similar to
vpand %ymm0,%ymm15,%ymm2
vpcmpeqd %ymm1,%ymm2,%ymm2
vpsubd %ymm2,%ymm4,%ymm4
surrounded by
300: vmovdqu (%rsi),%ymm0
add $0x20,%rsi
[8 3-instruction sequences]
cmp %rsi,%rdx
jne 300
It also uses 8 ymm accumulators, so not all of that fits into
registers, so three of the anded values are stored on the stack. For
Zen4 this could be improved by using only 2 accumulators. In any
case, the gcc people did something clever here, and I do not
understand how they got there from the source code, and why they did
not get there from keylocks1.c.
For clang's keylocks3-256 the inner loop and the outer loop are each
unrolled two times, resulting in and inner loop like:
190: vpbroadcastd (%r12,%rbx,4),%ymm5
vpand %ymm3,%ymm5,%ymm6
vpand %ymm4,%ymm5,%ymm5
vpcmpeqd %ymm1,%ymm5,%ymm5
vpsubd %ymm5,%ymm2,%ymm2
vpcmpeqd %ymm1,%ymm6,%ymm5
vpsubd %ymm5,%ymm0,%ymm0
vpbroadcastd 0x4(%r12,%rbx,4),%ymm5
vpand %ymm4,%ymm5,%ymm6
vpand %ymm3,%ymm5,%ymm5
vpcmpeqd %ymm1,%ymm5,%ymm5
vpsubd %ymm5,%ymm0,%ymm0
vpcmpeqd %ymm1,%ymm6,%ymm5
vpsubd %ymm5,%ymm2,%ymm2
add $0x2,%rbx
cmp %rbx,%rsi
jne 190
This results in the lowest AVX2 cycles, and I expect that one can use
that approach without crash problems without adding too many cycles.
The clang -march=x86-64-v4 results have similar code (with twice as
much inner-loop unrolling in case of keylocks3-512), but they all only
use AVX2 instructions and there have been successful runs on a Zen2
(which does not support AVX-512). It seems that clang does not
support AVX-512, or it does not understand -march=x86-64-v4 to allow
more than AVX2.
The least executed instructions is with gcc's keylocks2-512, where the
inner loop is:
230: vpbroadcastd 0x4(%rax),%zmm4
vpbroadcastd (%rax),%zmm0
mov %edx,%r10d
add $0x8,%rax
add $0x2,%edx
vpandd %zmm4,%zmm8,%zmm5
vpandd %zmm0,%zmm8,%zmm9
vpandd %zmm4,%zmm6,%zmm4
vptestnmd %zmm5,%zmm5,%k1
vpandd %zmm0,%zmm6,%zmm0
vmovdqa32 %zmm7,%zmm5{%k1}{z}
vptestnmd %zmm9,%zmm9,%k1
vmovdqa32 %zmm3,%zmm9{%k1}{z}
vptestnmd %zmm4,%zmm4,%k1
vpsubd %zmm9,%zmm5,%zmm5
vpaddd %zmm5,%zmm2,%zmm2
vmovdqa32 %zmm7,%zmm4{%k1}{z}
vptestnmd %zmm0,%zmm0,%k1
vmovdqa32 %zmm3,%zmm0{%k1}{z}
vpsubd %zmm0,%zmm4,%zmm0
vpaddd %zmm0,%zmm1,%zmm1
cmp %r10d,%r8d
jne 230
Due to UNROLL=32, it deals with 2 zmm registers coming from the outer
loop at a time, and the inner loop is unrolled by a factor of 2, too.
It uses vptestnmd and a predicated vmovdqa32 instead of using vpcmpeqd (why?). Anyway, the code seems to rub Zen4 the wrong way, and it
performs only at 2.84 IPC, worse than the AVX2 code. Rocket Lake
performs slightly better, but still, the clang code for keylocks2-512
runs a bit faster without using AVX-512.
I also saw one case where the compiler botched it:
gcc -Wall -DUNROLL=16 -O3 -mavx2 -c keylocks3.c
[/tmp/keylock:155546] LC_NUMERIC=prog perf stat -e cycles -e instructions keylocks3-256
603800000
Performance counter stats for 'keylocks3-256':
17_476_700_581 cycles
39_480_242_683 instructions # 2.26 insn per cycle
3.506995312 seconds time elapsed
3.507020000 seconds user
0.000000000 seconds sys
(cycles and timings on the 8700G). Here the compiler failed to
vectorize the comparison, and performed them using scalar instructions
(first extracting the data from the SIMD registers, and finally
inserting the result into SIMD registers, with additional overhead
from spilling registers). The result requires about 10 times more instructions than the UNROLL=8 variant and almost 20 times more
cycles.
On to timings per routine invocation:
On a 4.4Ghz Haswell (whereas Michael S. measured a 4GHz Haswell):
5.47us clang keylocks1-256 (5.66us for Michael S.'s "original code")
4.26us gcc keylocks1-256 (5.66us for Michael S.'s "original code")
2.38us gcc keylocks2-256 (2.18us for Michael S.'s manual vectorized code) 2.08us clang keylocks2-512 (2.18us for Michael S.'s manual vectorized code)
Michael S.'s "original code" performs similar on clang to my
keylocks1.c. clang's keylocks2-512 code is quite competetive with his
manual code.
On the Golden Cove of a Core i3-1315U (compared to the best result by
Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
according to Intel, whereas the former can supposedly run up to
4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
as well):
5.25us Terje Mathisen's Rust code compiled by clang (best on the 1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U
I would have expected the clang keylocks1-256 to run slower, because
the compiler back-end is the same and the 1315U is slower. Measuring
cycles looks more relevant for this benchmark to me than measuring
time, especially on this core where AVX-512 is disabled and there is
no AVX slowdown.
- anton
Simple question:: how would you port this code to a machine
with a different SIMD instruction set ??
Michael S <already5chosen@yahoo.com> writes:
In the mean time.
I did few measurements on Xeon E3 1271 v3. That is rather old uArch - >Haswell, the first core that supports AVX2. During the tests it was
running at 4.0 GHz.
1. Original code (rewritten in plain C) compiled with clang -O3 >-march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
AVX2 code compiled with clang -O3 -march=skylake (AVX2)
Results were as following (usec/call)
1 - 5.66
2 - 5.56
3 - 2.18
In the meantime, I also wrote the original code in plain C
(keylocks1.c), then implemented your idea of unrolling the outer loop
and comparing a subarray of locks to each key (is this called strip
mining?) in plain C (with the hope that auto-vectorization works) as keylocks2.c, and finally rewrote the latter version to use gcc vector extensions (keylocks3.c). I wrote a dummy main around that that calls
the routine 100_000 times; given that the original routine's
performance does not depend on the data, and I used non-0 keys (so keylocks[23].c does not skip any keys), the actual data is not
important.
You can find the source code and the binaries I measured at <http://www.complang.tuwien.ac.at/anton/keylock/>. The binaries were compiled with gcc 12.2.0 and (in the clang subdirectory) clang-14.0.6;
the clang compilations sometimes used different UNROLL factors than
the gcc compilations (and I am unsure, which, see below).
The original code is:
unsigned keylocks(unsigned keys[], unsigned nkeys, unsigned locks[],
unsigned nlocks) {
unsigned i, j;
unsigned part1 = 0;
for (i=0; i<nlocks; i++) {
unsigned lock = locks[i];
for (j=0; j<nkeys; j++)
part1 += (lock & keys[j])==0;
}
return part1;
}
For keylocks2.c the central loops are:
for (i=0; i<UNROLL; i++)
part0[i]=0;
for (i=0; i<nlocks1; i+=UNROLL) {
for (j=0; j<nkeys1; j++) {
unsigned key = keys1[j];
for (k=0; k<UNROLL; k++)
part0[k] += (locks1[i+k] & key)==0;
}
}
For UNROLL I tried 8, 16, and 32 for AVX2 and 16, 32, or 64 for
AVX-512; the numbers below are for those factors that produce the
lowest cycles on the Rocket Lake machine.
The central loops are preceded by code to arrange the data such that
this code works: locks are copied to the longer locks1; the length of
locks1 is a multiple of UNROLL, and the entries beyond nlocks are ~0
to increase the count by 0) and the keys are copies to keys1 (with 0
removed so that the extra locks are not counted, and that also may
increase efficiency if there is a key=0). The central loops are
followed by summing up the elements of part0.
keylocks3.c, which uses the gcc vector extensions, just changes
keylocks2.c in a few places. In particular, it adds a type vu:
typedef unsigned vu __attribute__ ((vector_size
(UNROLL*sizeof(unsigned))));
The central loops now look as follows:
for (i=0; i<UNROLL; i++)
part0[i]=0;
for (i=0; i<nlocks1; i+=UNROLL) {
vu lock = *(vu *)(locks1+i);
for (j=0; j<nkeys1; j++) {
part0 -= (lock & keys1[j])==0;
}
One interesting aspect of the gcc vector extensions is that the result
of comparing two vectors is 0 (false) or ~0 (true) (per element),
whereas for scalars the value for true is 1. Therefore the code above updates part0 with -=, whereas in keylocks2.c += is used.
While the use of ~0 is a good choice when designing a new programming language, I would have gone for 1 in the case of a vector extension
for C, for consistency with the scalar case; in combination with
hardware that produces ~0 (e.g., Intel SSE and AVX SIMD stuff), that
means that the compiler will introduce a negation in its intermediate representation at some point; I expect that compilers will usually be
able to optimize this negation away, but I would not be surprised at
cases where my expectation is disappointed.
keylocks3.c compiles without warning on clang, but the result usually segfaults (but sometime does not, e.g., in the timed run on Zen4; it segfaults in other runs on Zen4). I have not investigated why this
happens, I just did not include results from runs where it segfaulted;
and I tried additional runs for keylocks3-512 on Zen4 in order to have
one result there.
I would have liked to compare the performance of my code against your
code, but your code apparently was destroyed by arbitrary line
breaking in your news-posting software.
Anyway, here are my results.
First cycles (which eliminates worries about turbo modes) and
instructions, then usec/call.
The cores are:
The number of instructions executed is
(reported on the Zen4):
instructions
5_779_542_242 gcc avx2 1
3_484_942_148 gcc avx2 2 8
5_885_742_164 gcc avx2 3 8
7_903_138_230 clang avx2 1
7_743_938_183 clang avx2 2 8?
3_625_338_104 clang avx2 3 8?
4_204_442_194 gcc 512 1
2_564_142_161 gcc 512 2 32
3_061_042_178 gcc 512 3 16
7_703_938_205 clang 512 1
3_402_238_102 clang 512 2 16?
3_320_455_741 clang 512 3 16?
for gcc -mavx2 on keylocks3.c on Zen 4 an IPC of 6.44 is reported,
while microarchitecture descriptions report only a 6-wide renamer <https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine>.
My guess is that the front end combined some instructions (maybe
compare and branch) into a macro-op, and the renamer then processed 6 macro-ops that represented more instructions. The inner loop is
│190: vpbroadcastd (%rax),%ymm0
1.90 │ add $0x4,%rax
│ vpand %ymm2,%ymm0,%ymm0
1.09 │ vpcmpeqd %ymm3,%ymm0,%ymm0
0.41 │ vpsubd %ymm0,%ymm1,%ymm1
78.30 │ cmp %rdx,%rax
│ jne 190
and if the cmp and jne are combined into one macro-op, that would be
perfect for executing one iteration per cycle.
It's interesting that gcc's keylocks2-256 results on far fewer
instructions (and eventually, cycles). It unrolls the inner loop 8
times to process the keys in SIMD fashion, too, loading the keys one
ymm register at a time. In order to do that it arranges the locks in
8 different ymm registers in the outer loop, so the inner loop
performs 8 sequences similar to
vpand %ymm0,%ymm15,%ymm2
vpcmpeqd %ymm1,%ymm2,%ymm2
vpsubd %ymm2,%ymm4,%ymm4
surrounded by
300: vmovdqu (%rsi),%ymm0
add $0x20,%rsi
[8 3-instruction sequences]
cmp %rsi,%rdx
jne 300
It also uses 8 ymm accumulators, so not all of that fits into
registers, so three of the anded values are stored on the stack. For
Zen4 this could be improved by using only 2 accumulators. In any
case, the gcc people did something clever here, and I do not
understand how they got there from the source code, and why they did
not get there from keylocks1.c.
For clang's keylocks3-256 the inner loop and the outer loop are each
unrolled two times, resulting in and inner loop like:
190: vpbroadcastd (%r12,%rbx,4),%ymm5
vpand %ymm3,%ymm5,%ymm6
vpand %ymm4,%ymm5,%ymm5
vpcmpeqd %ymm1,%ymm5,%ymm5
vpsubd %ymm5,%ymm2,%ymm2
vpcmpeqd %ymm1,%ymm6,%ymm5
vpsubd %ymm5,%ymm0,%ymm0
vpbroadcastd 0x4(%r12,%rbx,4),%ymm5
vpand %ymm4,%ymm5,%ymm6
vpand %ymm3,%ymm5,%ymm5
vpcmpeqd %ymm1,%ymm5,%ymm5
vpsubd %ymm5,%ymm0,%ymm0
vpcmpeqd %ymm1,%ymm6,%ymm5
vpsubd %ymm5,%ymm2,%ymm2
add $0x2,%rbx
cmp %rbx,%rsi
jne 190
This results in the lowest AVX2 cycles, and I expect that one can use
that approach without crash problems without adding too many cycles.
The clang -march=x86-64-v4 results have similar code (with twice as
much inner-loop unrolling in case of keylocks3-512), but they all only
use AVX2 instructions and there have been successful runs on a Zen2
(which does not support AVX-512). It seems that clang does not
support AVX-512, or it does not understand -march=x86-64-v4 to allow
more than AVX2.
The least executed instructions is with gcc's keylocks2-512, where the
inner loop is:
230: vpbroadcastd 0x4(%rax),%zmm4
vpbroadcastd (%rax),%zmm0
mov %edx,%r10d
add $0x8,%rax
add $0x2,%edx
vpandd %zmm4,%zmm8,%zmm5
vpandd %zmm0,%zmm8,%zmm9
vpandd %zmm4,%zmm6,%zmm4
vptestnmd %zmm5,%zmm5,%k1
vpandd %zmm0,%zmm6,%zmm0
vmovdqa32 %zmm7,%zmm5{%k1}{z}
vptestnmd %zmm9,%zmm9,%k1
vmovdqa32 %zmm3,%zmm9{%k1}{z}
vptestnmd %zmm4,%zmm4,%k1
vpsubd %zmm9,%zmm5,%zmm5
vpaddd %zmm5,%zmm2,%zmm2
vmovdqa32 %zmm7,%zmm4{%k1}{z}
vptestnmd %zmm0,%zmm0,%k1
vmovdqa32 %zmm3,%zmm0{%k1}{z}
vpsubd %zmm0,%zmm4,%zmm0
vpaddd %zmm0,%zmm1,%zmm1
cmp %r10d,%r8d
jne 230
Due to UNROLL=32, it deals with 2 zmm registers coming from the outer
loop at a time, and the inner loop is unrolled by a factor of 2, too.
It uses vptestnmd and a predicated vmovdqa32 instead of using vpcmpeqd (why?). Anyway, the code seems to rub Zen4 the wrong way, and it
performs only at 2.84 IPC, worse than the AVX2 code. Rocket Lake
performs slightly better, but still, the clang code for keylocks2-512
runs a bit faster without using AVX-512.
I also saw one case where the compiler botched it:
gcc -Wall -DUNROLL=16 -O3 -mavx2 -c keylocks3.c
[/tmp/keylock:155546] LC_NUMERIC=prog perf stat -e cycles -e
instructions keylocks3-256 603800000
Performance counter stats for 'keylocks3-256':
17_476_700_581 cycles
39_480_242_683 instructions # 2.26 insn
per cycle
3.506995312 seconds time elapsed
3.507020000 seconds user
0.000000000 seconds sys
(cycles and timings on the 8700G). Here the compiler failed to
vectorize the comparison, and performed them using scalar instructions
(first extracting the data from the SIMD registers, and finally
inserting the result into SIMD registers, with additional overhead
from spilling registers). The result requires about 10 times more instructions than the UNROLL=8 variant and almost 20 times more
cycles.
On to timings per routine invocation:
On a 4.4Ghz Haswell (whereas Michael S. measured a 4GHz Haswell):
5.47us clang keylocks1-256 (5.66us for Michael S.'s "original code")
4.26us gcc keylocks1-256 (5.66us for Michael S.'s "original code")
2.38us gcc keylocks2-256 (2.18us for Michael S.'s manual vectorized
code) 2.08us clang keylocks2-512 (2.18us for Michael S.'s manual
vectorized code)
Michael S.'s "original code" performs similar on clang to my
keylocks1.c. clang's keylocks2-512 code is quite competetive with his
manual code.
On the Golden Cove of a Core i3-1315U (compared to the best result by
Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
according to Intel, whereas the former can supposedly run up to
4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
as well):
5.25us Terje Mathisen's Rust code compiled by clang (best on the
1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U
I would have expected the clang keylocks1-256 to run slower, because
the compiler back-end is the same and the 1315U is slower. Measuring
cycles looks more relevant for this benchmark to me than measuring
time, especially on this core where AVX-512 is disabled and there is
no AVX slowdown.
- anton
On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Or by my own pasting mistake. I am still not sure whom to blame.
The mistake was tiny - absence of // at the begining of one line, but
enough to not compile. Trying it for a second time:
if (li >=3D len || li <=3D 0)
First cycles (which eliminates worries about turbo modes) and
instructions, then usec/call.
=20
I don't understand that.
For original code optimized by clang I'd expect 22,000 cycles and 5.15
usec per call on Haswell. You numbers don't even resamble anything like
that.
instructions
5_779_542_242 gcc avx2 1 =20
3_484_942_148 gcc avx2 2 8=20
5_885_742_164 gcc avx2 3 8=20
7_903_138_230 clang avx2 1 =20
7_743_938_183 clang avx2 2 8?
3_625_338_104 clang avx2 3 8?=20
4_204_442_194 gcc 512 1 =20
2_564_142_161 gcc 512 2 32
3_061_042_178 gcc 512 3 16
7_703_938_205 clang 512 1 =20
3_402_238_102 clang 512 2 16?
3_320_455_741 clang 512 3 16?
=20
I don't understand these numbers either. For original clang, I'd expect >25,000 instructions per call.
Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
Which could be due to differences in measurements methodology - I
reported median of 11 runs, you seems to report average.
On the Golden Cove of a Core i3-1315U (compared to the best result by
Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
according to Intel, whereas the former can supposedly run up to
4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
as well):
=20
I always thought that NUCs have better cooling than all, but high-end >laptops. Was I wrong? Such slowness is disappointing.
5.25us Terje Mathisen's Rust code compiled by clang (best on the
1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U
=20
So, for the best-performing variant IPC of Goldeen Cove is identical to >ancient Haswell?
That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to throughput
of two per clock. Golden Cove has 5+ wide front end and nearly all AVX2 >integer instruction have throughput of three per clock.
Could it be that clang introduced some sort of latency bottleneck?
I would have expected the clang keylocks1-256 to run slower, because
the compiler back-end is the same and the 1315U is slower. Measuring
cycles looks more relevant for this benchmark to me than measuring
time, especially on this core where AVX-512 is disabled and there is
no AVX slowdown.
=20
I prefer time, because at the end it's the only thing that matter.
Michael S <already5chosen@yahoo.com> writes:
On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Or by my own pasting mistake. I am still not sure whom to blame.
The mistake was tiny - absence of // at the begining of one line, but >enough to not compile. Trying it for a second time:
Now it's worse, it's quoted-printable. E.g.:
if (li >=3D len || li <=3D 0)
Some newsreaders can decode this, mine does not.
Michael S <already5chosen@yahoo.com> writes:
On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Or by my own pasting mistake. I am still not sure whom to blame.
The mistake was tiny - absence of // at the begining of one line, but >enough to not compile. Trying it for a second time:
Now it's worse, it's quoted-printable. E.g.:
if (li >=3D len || li <=3D 0)
Some newsreaders can decode this, mine does not.
First cycles (which eliminates worries about turbo modes) and
instructions, then usec/call.
=20
I don't understand that.
For original code optimized by clang I'd expect 22,000 cycles and
5.15 usec per call on Haswell. You numbers don't even resamble
anything like that.
My cycle numbers are for the whole program that calls keylocks()
100_000 times.
If you divide the cycles by 100000, you get 21954 for clang
keylocks1-256, which is what you expect.
instructions
5_779_542_242 gcc avx2 1 =20
3_484_942_148 gcc avx2 2 8=20
5_885_742_164 gcc avx2 3 8=20
7_903_138_230 clang avx2 1 =20
7_743_938_183 clang avx2 2 8?
3_625_338_104 clang avx2 3 8?=20
4_204_442_194 gcc 512 1 =20
2_564_142_161 gcc 512 2 32
3_061_042_178 gcc 512 3 16
7_703_938_205 clang 512 1 =20
3_402_238_102 clang 512 2 16?
3_320_455_741 clang 512 3 16?
=20
I don't understand these numbers either. For original clang, I'd
expect 25,000 instructions per call.
clang keylocks1-256 performs 79031 instructions per call (divide the
number given by 100000 calls). If you want to see why that is, you
need to analyse the code produced by clang, which I did only for
select cases.
Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
Which could be due to differences in measurements methodology - I
reported median of 11 runs, you seems to report average.
I just report one run with 100_000 calls, and just hope that the
variation is small:-) In my last refereed paper I use 30 runs and
median, but I don't go to these lengths here; the cycles seem pretty repeatable.
On the Golden Cove of a Core i3-1315U (compared to the best result
by Terje Mathisen on a Core i7-1365U; the latter can run up to
5.2GHz according to Intel, whereas the former can supposedly run
up to 4.5GHz; I only ever measured at most 3.8GHz on our NUC, and
this time as well):
=20
I always thought that NUCs have better cooling than all, but high-end >laptops. Was I wrong? Such slowness is disappointing.
The cooling may be better or not, that does not come into play here,
as it never reaches higher clocks, even when it's cold; E-cores also
stay 700MHz below their rated turbo speed, even when it's the only
loaded core. One theory I have is that one option we set up in the
BIOS has the effect of limiting turbo speed, but it has not been
important enough to test.
5.25us Terje Mathisen's Rust code compiled by clang (best on the
1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U
=20
So, for the best-performing variant IPC of Goldeen Cove is identical
to ancient Haswell?
Actually worse:
For clang keylocks2-512 Haswell has 3.73 IPC, Golden Cove 3.63.
That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to throughput
of two per clock. Golden Cove has 5+ wide front end and nearly all
AVX2 integer instruction have throughput of three per clock.
Could it be that clang introduced some sort of latency bottleneck?
As far as I looked into the code, I did not see such a bottleneck.
Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
clang keylocks2-256), and I expect that it would suffer from a general latency bottleneck, too. Rocket Lake is also faster on this program
than Haswell and Golden Cove. It seems to be just that this program
rubs Golden Cove the wrong way.
I would have expected the clang keylocks1-256 to run slower,
because the compiler back-end is the same and the 1315U is slower.
Measuring cycles looks more relevant for this benchmark to me
than measuring time, especially on this core where AVX-512 is
disabled and there is no AVX slowdown.
=20
I prefer time, because at the end it's the only thing that matter.
True, and certainly, when stuff like AVX-512 license-based
downclocking or thermal or power limits come into play (and are
relevant for the measurement at hand), one has to go there. But then
you can only compare code running on the same kind of machine,
configured the same way. Or maybe just running on the same
machine:-). But then, the generality of the results is questionable.
- anton
On 2/21/2025 1:51 PM, EricP wrote:
BGB wrote:
Can note that the latency of carry-select adders is a little weird:
16/32/64: Latency goes up steadily;
But, still less than linear;
128-bit: Only slightly more latency than 64-bit.
The best I could find in past testing was seemingly 16-bit chunks for
normal adding. Where, 16-bits seemed to be around the break-even
between the chained CARRY4's and the Carry-Select (CS being slower
below 16 bits).
But, for a 64-bit adder, still basically need to give it a
clock-cycle to do its thing. Though, not like 32 is particularly fast
either; hence part of the whole 2 cycle latency on ALU ops thing.
Mostly has to do with ADD/SUB (and CMP, which is based on SUB).
Admittedly part of why I have such mixed feelings on full
compare-and- branch:
Pro: It can offer a performance advantage (in terms of per-clock);
Con: Branch is now beholden to the latency of a Subtract.
IIRC your cpu clock speed is about 75 MHz (13.3 ns)
and you are saying it takes 2 clocks for a 64-bit ADD.
The 75MHz was mostly experimental, mostly I am running at 50MHz because
it is easier (a whole lot of corners need to be cut for 75MHz, so often overall performance ended up being worse).
Via the main ALU, which also shares the logic for SUB and CMP and
similar...
Generally, I give more or less a full cycle for the ADD to do its thing,
with the result presented to the outside world on the second cycle,
where it can go through the register forwarding chains and similar.
This gives it a 2 cycle latency.
Operations with a 1 cycle latency need to feed their output directly
into the register forwarding logic.
In a pseudocode sense, something like:
tValB = IsSUB ? ~valB : valB;
tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
tAddC0=...
...
tAddSbA = tCarryIn;
tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
...
tAddRes = {
tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
};
This works, but still need to ideally give it a full clock-cycle to do
its work.
Note that one has to be careful with logic coupling, as if too many
things are tied together, one may get a "routing congestion" warning
message, and generally timing fails in this case...
Also, "inferring latch" warning is one of those "you really gotta go fix this" issues (both generally indicates Verilog bugs, and also negatively effects timing).
I don't remember what Xilinx chip you are using but this paper describes
how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
on a Virtex-5:
A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/
wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/projects/
project_1_resources/Adders_MELECON_2010.pdf
As for Virtex: I am not made of money...
Virtex tends to be absurdly expensive high-end FPGAs.
Even the older Virtex chips are still absurdly expensive.
Kintex is considered mid range, but still too expensive, and mostly not usable in the free versions of Vivado (and there are no real viable FOSS alternatives to Vivado). When I tried looking at some of the "open
source" tools for targeting Xilinx chips, they were doing the hacky
thing of basically invoking Xilinx's tools in the background (which, if
used to target a Kintex, is essentially piracy).
Where, a valid FOSS tool would need to be able to do everything and
generate the bitstream itself.
Mostly I am using Spartan-7 and Artix-7.
Generally at the -1 speed grade (slowest, but cheapest).
These are mostly considered low-end and consumer-electronics oriented
FPGAs by Xilinx.
I have a QMTech board with an XC7A200T at -1, but generally, it seems to actually have a slightly harder time passing timing constraints than the XC7A100T in the Nexys A7 (possibly some sort of Vivado magic here).
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
Errm, skim, this doesn't really look like something you can pull off in normal Verilog.
Generally, one doesn't control over how the components hook together,
only one can influence what happens based on how they write their Verilog.
You can just write:
reg[63:0] tValA;
reg[63:0] tValB;
reg[63:0] tValC;
tValC=tValA+tValB;
But, then it spits out something with a chain of 16 CARRY4's, so there
is a fairly high latency on the high order bits of the result.
Generally, Vivado synthesis seems to mostly be happy (at 50 MHz), if the total logic path length stays under around 12 or so. Paths with 15 or
more are often near the edge of failing timing.
At 75MHz, one has to battle with pretty much anything much over 8.
And, at 200MHz, you have have path lengths of 2 that are failing...
Like, it seemingly can't do much more than "FF -> LUT -> FF" at these
speeds.
On 2025-02-22 10:16 a.m., EricP wrote:
BGB wrote:I am sure it can be done as I have seen a lot of papers too with results
Generally, Vivado synthesis seems to mostly be happy (at 50 MHz), if
the total logic path length stays under around 12 or so. Paths with
15 or more are often near the edge of failing timing.
At 75MHz, one has to battle with pretty much anything much over 8.
And, at 200MHz, you have have path lengths of 2 that are failing...
Like, it seemingly can't do much more than "FF -> LUT -> FF" at these
speeds.
This can't just be left to the random luck of the wire router.
There must be something else that these commercial and academic users
are able to do to reliably optimize their design.
Maybe its a tool only available to big bucks customers.
This has me curious. I'm going to keep looking around.
in the hundreds of megahertz. It has got to be the manual placement and routing that helps. The routing in my design typically takes up about
80% of the delay. One can build circuits up out of individual primitive
gates in Verilog (or(), and(), etc) but for behavioral purposes I do not
do that, instead relying on the tools to generate the best combinations
of gates. It is a ton of work to do everything manually. I am happy to
have things work at 40 MHz even though 200 MHz may be possible with 10x
the work put into it. Typically running behavioural code. Doing things
mostly for my own edification. ( I have got my memory controller working
at 200 MHz, so it is possible).
One thing that I have found that helps is to use smaller modules and
tasks for repetitive code where possible. The tools seem to put together
a faster design if everything is smaller modules. I ponder it may have
to do with making place and route easier.
The incremental cost is in a sequencer in the AGU for handling cache...
line and possibly virtual page straddles, and a small byte shifter to
left shift the high order bytes. The AGU sequencer needs to know if the
line straddles a page boundary, if not then increment the 6-bit physical
line number within the 4 kB physical frame number, if yes then increment >virtual page number and TLB lookup again and access the first line.
(Slightly more if multiple page sizes are supported, but same idea.)
For a load AGU merges the low and high fragments and forwards.
The hardware cost appears trivial, especially within an OoO core.
So there doesn't appear to be any reason to not handle this.
Am I missing something?
https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-microarchitecture/...
This terrible unaligned access behavior is atypical even for low power
cores. Arm's Cortex A75 only takes 15 cycles in the worst case of
dependent accesses that are both misaligned.
Digging deeper with performance counters reveals executing each unaligned >load instruction results in ~505 executed instructions.
P550 almost
certainly doesnÆt have hardware support for unaligned accesses.
Rather, itÆs likely raising a fault and letting an operating system
handler emulate it in software."
The OS must also be able to keep both pages in physical memory until
the access is complete, or there will be no progress. Should not be a problem these days, but the 48 pages or so potentially needed by VAX complicated the OS.
As you can see in the article below, the cost of NOT handling misaligned accesses in hardware is quite high in cpu clocks.
To my eye, the incremental cost of adding hardware support for
misaligned
to the AGU and cache data path should be quite low. The alignment
shifter
is basically the same: assuming a 64-byte cache line, LD still has to
shift any of the 64 bytes into position 0, and reverse for ST.
The incremental cost is in a sequencer in the AGU for handling cache
line and possibly virtual page straddles, and a small byte shifter to
left shift the high order bytes. The AGU sequencer needs to know if the
line straddles a page boundary, if not then increment the 6-bit physical
line number within the 4 kB physical frame number, if yes then increment virtual page number and TLB lookup again and access the first line.
(Slightly more if multiple page sizes are supported, but same idea.)
For a load AGU merges the low and high fragments and forwards.
I don't think there are line straddle consequences for coherence because there is no ordering guarantees for misaligned accesses.
The hardware cost appears trivial, especially within an OoO core.
So there doesn't appear to be any reason to not handle this.
Am I missing something?
https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-microarchitecture/
[about half way down]
"Before accessing cache, load addresses have to be checked against
older stores (and vice versa) to ensure proper ordering. If there is a dependency, P550 can only do fast store forwarding if the load and store addresses match exactly and both accesses are naturally aligned.
Any unaligned access, dependent or not, confuses P550 for hundreds of
cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
An unaligned load takes 1062 cycles, an unaligned store takes
741 cycles, and the two together take over 1800 cycles.
This terrible unaligned access behavior is atypical even for low power
cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
dependent accesses that are both misaligned.
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-microarchitecture/
[about half way down]
"Before accessing cache, load addresses have to be checked against
older stores (and vice versa) to ensure proper ordering. If there is a dependency, P550 can only do fast store forwarding if the load and store addresses match exactly and both accesses are naturally aligned.
Any unaligned access, dependent or not, confuses P550 for hundreds of
cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
An unaligned load takes 1062 cycles, an unaligned store takes
741 cycles, and the two together take over 1800 cycles.
This terrible unaligned access behavior is atypical even for low power
cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
dependent accesses that are both misaligned.
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
On 2/2/2025 10:51 AM, MitchAlsup1 wrote:-------------
On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
I don't think there are line straddle consequences for coherence because >>> there is no ordering guarantees for misaligned accesses.
Generally stated as:: Misaligned accesses cannot be considered ATOMIC.
Try it on an x86/x64. Straddle a l2 cache line and use it with a LOCK'ed
RMW. It should assert the BUS lock.
On 2/2/2025 10:45 AM, EricP wrote:
As you can see in the article below, the cost of NOT handling misaligned
accesses in hardware is quite high in cpu clocks.
To my eye, the incremental cost of adding hardware support for
misaligned
to the AGU and cache data path should be quite low. The alignment
shifter
is basically the same: assuming a 64-byte cache line, LD still has to
shift any of the 64 bytes into position 0, and reverse for ST.
The incremental cost is in a sequencer in the AGU for handling cache
line and possibly virtual page straddles, and a small byte shifter to
left shift the high order bytes. The AGU sequencer needs to know if the
line straddles a page boundary, if not then increment the 6-bit physical
line number within the 4 kB physical frame number, if yes then increment
virtual page number and TLB lookup again and access the first line.
(Slightly more if multiple page sizes are supported, but same idea.)
For a load AGU merges the low and high fragments and forwards.
I don't think there are line straddle consequences for coherence because
there is no ordering guarantees for misaligned accesses.
IMO, the main costs of unaligned access in hardware:
Cache may need two banks of cache lines
lets call them "even" and "odd".
an access crossing a line boundary may need both an even and odd
line;
slightly more expensive extract and insert logic.
The main costs of not having unaligned access in hardware:
Code either faults or performs like dog crap;
Some pieces of code need convoluted workarounds;
Some algorithms have no choice other than to perform like crap.
Even if most of the code doesn't need unaligned access, the parts that
do need it, significantly need it to perform well.
Well, at least excluding wonk in the ISA, say:
A load/store pair that discards the low-order bits;
An extract/insert instruction that operates on a register pair using the LOB's of the pointer.
In effect, something vaguely akin (AFAIK) to what existed on the DEC
Alpha.
The hardware cost appears trivial, especially within an OoO core.
So there doesn't appear to be any reason to not handle this.
Am I missing something?
For an OoO core, any cost difference in the L1 cache here is likely to
be negligible.
For anything much bigger than a small microcontroller, I would assume designing a core that handles unaligned access effectively.
https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-
microarchitecture/
[about half way down]
"Before accessing cache, load addresses have to be checked against
older stores (and vice versa) to ensure proper ordering. If there is a
dependency, P550 can only do fast store forwarding if the load and store
addresses match exactly and both accesses are naturally aligned.
Any unaligned access, dependent or not, confuses P550 for hundreds of
cycles. Worse, the unaligned loads and stores don’t proceed in parallel. >> An unaligned load takes 1062 cycles, an unaligned store takes
741 cycles, and the two together take over 1800 cycles.
This terrible unaligned access behavior is atypical even for low power
cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
dependent accesses that are both misaligned.
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
An emulation fault, or something similarly nasty...
At that point, even turning any potentially unaligned load or store into
a runtime call is likely to be a lot cheaper.
Say:
__mem_ld_unaligned:
ANDI X15, X10, 7
BEQ .aligned, X15, X0
SUB X14, X10, X15
LW X12, 0(X14)
LW X13, 8(X14)
SLLI X14, X15, 3
LI X17, 64
SUB X16, X17, X14
SRL X12, X12, X14
SLL X13, X13, X16
OR X10, X12, X13
RET
.aligned:
LW X10, 0(X10)
RET
The aligned case being because SRL with 64 will simply give the input
(since (64&63)==0), causing it to break.
Though not supported by GCC or similar, dedicated __aligned and
__unaligned keywords could help here, to specify which pointers are
aligned (no function call), unaligned (needs function call) and default (probably aligned).
....
On 2/2/2025 12:10 PM, Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
The OS must also be able to keep both pages in physical memory until
the access is complete, or there will be no progress. Should not be a
problem these days, but the 48 pages or so potentially needed by VAX
complicated the OS.
48 pages? What instruction would need that?
Hmm...
I ended up with a 4-way set associative TLB as it ended up being needed
to avoid the CPU getting stuck in a TLB-miss loop in the worst-case
scenario:
An instruction fetch where the line-pair crosses a page boundary (and L1
I$ misses) for an instruction accessing a memory address where the
line-pair also crosses a page boundary (and the L1 D$ misses).
One can almost get away with two-way, except that almost inevitably the
CPU would encounter and get stuck in an infinite TLB miss loop (despite
the seeming rarity, happens roughly once every few seconds or so).
....
BGB wrote:
On 2/2/2025 12:10 PM, Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
The OS must also be able to keep both pages in physical memory until
the access is complete, or there will be no progress. Should not be a >>>> problem these days, but the 48 pages or so potentially needed by VAX
complicated the OS.
48 pages? What instruction would need that?
Hmm...
I ended up with a 4-way set associative TLB as it ended up being
needed to avoid the CPU getting stuck in a TLB-miss loop in the
worst-case scenario:
An instruction fetch where the line-pair crosses a page boundary (and
L1 I$ misses) for an instruction accessing a memory address where the
line-pair also crosses a page boundary (and the L1 D$ misses).
One can almost get away with two-way, except that almost inevitably
the CPU would encounter and get stuck in an infinite TLB miss loop
(despite the seeming rarity, happens roughly once every few seconds or
so).
....
That is because you have a software managed TLB so all PTE's
referenced by an instruction must be resident in TLB for success.
If three PTE are required by an instruction and they map to
the same 2-way row and conflict evict then bzzzzt livelock loop.
So you need at least as many set assoc TLB ways as the worst case VA's referenced by any instruction.
With a HW table walker you can just let it evict and reload.
On 2/2/2025 10:45 AM, EricP wrote:
Digging deeper with performance counters reveals executing each unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
An emulation fault, or something similarly nasty...
At that point, even turning any potentially unaligned load or store into
a runtime call is likely to be a lot cheaper.
Michael S <already5chosen@yahoo.com> writes:
On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to throughput
of two per clock. Golden Cove has 5+ wide front end and nearly all
AVX2 integer instruction have throughput of three per clock.
Could it be that clang introduced some sort of latency bottleneck?
As far as I looked into the code, I did not see such a bottleneck.
Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
clang keylocks2-256), and I expect that it would suffer from a general latency bottleneck, too. Rocket Lake is also faster on this program
than Haswell and Golden Cove. It seems to be just that this program
rubs Golden Cove the wrong way.
You can find the source code and the binaries I measured at <http://www.complang.tuwien.ac.at/anton/keylock/>.
No, the real problem is when a compiler want to auto-vectorize any code working with 1/2/4/8 byte items: All of a sudden the alignment
requirement went from the item stride to the vector register stride
(16/32/64 bytes).
The only way this can work is to have the compiler control _all_
allocations to make sure they are properly aligned, including code in libraries, or the compiler will be forced to use vector load/store
operations which do allow unaligned access.
On 2025-02-03, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 2/2/2025 10:45 AM, EricP wrote:
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses. >>>> Rather, it’s likely raising a fault and letting an operating system >>>> handler emulate it in software."
An emulation fault, or something similarly nasty...
At that point, even turning any potentially unaligned load or store into >>> a runtime call is likely to be a lot cheaper.
There are lots of potentially unaligned loads and stores. There are
very few actually unaligned loads and stores: On Linux-Alpha every
unaligned access is logged by default, and the number of
unaligned-access entries in the logs of our machines was relatively
small (on average a few per day). So trapping actual unaligned
accesses was faster than replacing potential unaligned accesses with
code sequences that synthesize the unaligned access from aligned
accesses.
If you compile regular C/C++ code that does not intentionally do any
nasty stuff, you will typically have zero unaligned loads stores.
My machine still does not support unaligned accesses in hardware (it's
on the todo list), and it can run an awful lot of software without
problems.
The problem arises when the programmer *deliberately* does unaligned
loads and stores in order to improve performance. Or rather, if the programmer knows that the hardware supports unaligned loads and stores, he/she can use that to write faster code in some special cases.
On 2/22/2025 1:25 PM, Robert Finch wrote:
On 2025-02-22 10:16 a.m., EricP wrote:
BGB wrote:
On 2/21/2025 1:51 PM, EricP wrote:
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs,
2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
Errm, skim, this doesn't really look like something you can pull off
in normal Verilog.
Well that's what I'm trying to figure out because its not just this
paper
but a lot, like many hundreds, of papers I've read from commercial or
academic source that seem to be able to control the FPGA results
to a fine degree.
You could invoke some of the LE's directly as primitives in Verilog, but
then one has an ugly mess that will only work on a specific class of FPGA.
Generally though, one has access in terms of said primitives, rather
than control over the logic block.
Vs, say, code that will work with Verilator, Vivado, and Quartus,
without needing to be entirely rewritten for each.
Though, that said, my design might still need some reworking to be "effective" with Quartus or Altera hardware; or to use the available hardware.
Say, rather than like on a Spartan or Artix (pure FPGA), the Cyclone
FPGA's tend to include ARM hard processors, with the FPGA and ARM cores
able to communicate over a bus. The FPGA part of the DE10 apparently has
its own RAM chip, but it is SDRAM (rather than DDR2 or DDR3 like in a
lot of the Xilinx based boards).
Well, apart from some low-end boards which use QSPI SRAMs (though,
having looked, a lot of these RAMs are DRAM internally, but the RAM
module has its own RAM refresh logic).
I am sure it can be done as I have seen a lot of papers too with
This can't just be left to the random luck of the wire router.
There must be something else that these commercial and academic users
are able to do to reliably optimize their design.
Maybe its a tool only available to big bucks customers.
This has me curious. I'm going to keep looking around.
results in the hundreds of megahertz. It has got to be the manual
placement and routing that helps. The routing in my design typically
takes up about 80% of the delay. One can build circuits up out of
individual primitive gates in Verilog (or(), and(), etc) but for
behavioral purposes I do not do that, instead relying on the tools to
generate the best combinations of gates. It is a ton of work to do
everything manually. I am happy to have things work at 40 MHz even
though 200 MHz may be possible with 10x the work put into it.
Typically running behavioural code. Doing things mostly for my own
edification. ( I have got my memory controller working at 200 MHz, so
it is possible).
One thing that I have found that helps is to use smaller modules and
tasks for repetitive code where possible. The tools seem to put
together a faster design if everything is smaller modules. I ponder it
may have to do with making place and route easier.
It is also possible to get higher speeds with smaller/simple designs.
But, yeah, also I can note in Vivado, that the timing does tend to be dominated more by "net delay" rather than "logic delay".
This is why my thoughts for a possible 75 MHz focused core would be to
drop down to 2-wide superscalar. It is more a question of what could be
done to try to leverage the higher clock-speed to an advantage (and not
lose too much performance in other areas).
On 2/21/2025 1:51 PM, EricP wrote:
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
Errm, skim, this doesn't really look like something you can pull off in normal Verilog.
Generally, one doesn't control over how the components hook together,
only one can influence what happens based on how they write their Verilog.
You can just write:
reg[63:0] tValA;
reg[63:0] tValB;
reg[63:0] tValC;
tValC=tValA+tValB;
But, then it spits out something with a chain of 16 CARRY4's, so there
is a fairly high latency on the high order bits of the result.
BGB wrote:
On 2/22/2025 1:25 PM, Robert Finch wrote:
On 2025-02-22 10:16 a.m., EricP wrote:
BGB wrote:
On 2/21/2025 1:51 PM, EricP wrote:
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
Fast and Area Efficient Adder for Wide Data in Recent Xilinx
FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
Errm, skim, this doesn't really look like something you can pull
off in normal Verilog.
Well that's what I'm trying to figure out because its not just
this paper
but a lot, like many hundreds, of papers I've read from
commercial or academic source that seem to be able to control the
FPGA results to a fine degree.
You could invoke some of the LE's directly as primitives in
Verilog, but then one has an ugly mess that will only work on a
specific class of FPGA.
Generally though, one has access in terms of said primitives,
rather than control over the logic block.
Vs, say, code that will work with Verilator, Vivado, and Quartus,
without needing to be entirely rewritten for each.
Though, that said, my design might still need some reworking to be "effective" with Quartus or Altera hardware; or to use the
available hardware.
Ok but this "portability" appears to be costing you dearly.
Say, rather than like on a Spartan or Artix (pure FPGA), the
Cyclone FPGA's tend to include ARM hard processors, with the FPGA
and ARM cores able to communicate over a bus. The FPGA part of the
DE10 apparently has its own RAM chip, but it is SDRAM (rather than
DDR2 or DDR3 like in a lot of the Xilinx based boards).
Well, apart from some low-end boards which use QSPI SRAMs (though,
having looked, a lot of these RAMs are DRAM internally, but the RAM
module has its own RAM refresh logic).
I am sure it can be done as I have seen a lot of papers too with
This can't just be left to the random luck of the wire router.
There must be something else that these commercial and academic
users are able to do to reliably optimize their design.
Maybe its a tool only available to big bucks customers.
This has me curious. I'm going to keep looking around.
results in the hundreds of megahertz. It has got to be the manual
placement and routing that helps. The routing in my design
typically takes up about 80% of the delay. One can build circuits
up out of individual primitive gates in Verilog (or(), and(), etc)
but for behavioral purposes I do not do that, instead relying on
the tools to generate the best combinations of gates. It is a ton
of work to do everything manually. I am happy to have things work
at 40 MHz even though 200 MHz may be possible with 10x the work
put into it. Typically running behavioural code. Doing things
mostly for my own edification. ( I have got my memory controller
working at 200 MHz, so it is possible).
One thing that I have found that helps is to use smaller modules
and tasks for repetitive code where possible. The tools seem to
put together a faster design if everything is smaller modules. I
ponder it may have to do with making place and route easier.
It is also possible to get higher speeds with smaller/simple
designs.
But, yeah, also I can note in Vivado, that the timing does tend to
be dominated more by "net delay" rather than "logic delay".
This is why my thoughts for a possible 75 MHz focused core would be
to drop down to 2-wide superscalar. It is more a question of what
could be done to try to leverage the higher clock-speed to an
advantage (and not lose too much performance in other areas).
You are missing my point. You are trying work around a problem with
low level module design by rearranging high level architecture
components.
It sounds like your ALU stage is taking about 20 ns to do an ADD
and that is having consequences that ripple through the design,
like taking an extra clock for result forwarding,
which causes performance issues when considering Compare And Branch,
and would cause a stall with back-to-back operations.
This goes back to module optimization where you said:
BGB wrote:
On 2/21/2025 1:51 PM, EricP wrote:
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
Fast and Area Efficient Adder for Wide Data in Recent Xilinx
FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
Errm, skim, this doesn't really look like something you can pull
off in normal Verilog.
Generally, one doesn't control over how the components hook
together, only one can influence what happens based on how they
write their Verilog.
You can just write:
reg[63:0] tValA;
reg[63:0] tValB;
reg[63:0] tValC;
tValC=tValA+tValB;
But, then it spits out something with a chain of 16 CARRY4's, so
there is a fairly high latency on the high order bits of the
result.
It looks to me that Vivado intends that after you get your basic
design working, this module optimization is *exactly* what one is
supposed to do.
In this case the prototype design establishes that you need multiple
64-bit adders and the generic ones synthesis spits out are slow.
So you isolate that module off, use Verilog to drive the basic LE
selections, then iterate doing relative LE placement specifiers,
route the module, and when you get the fastest 64-bit adder you can
then lock down the netlist and save the module design.
Now you have a plug-in 64-bit adder module that runs at (I don't know
the speed difference between Virtex and your Spartan-7 so wild guess)
oh, say, 4 ns, to use multiple places... fetch, decode, alu, agu.
Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
isolate that module, optimize placement, route, lock down netlist,
and now you have a 5 ns plug-in ALU module.
Doing this you build up your own IP library of optimized hardware
modules.
As more and more modules are optimized the system synthesis gets
faster because much of the fine grain work and routing is already
done.
On 2/3/2025 12:55 AM, Anton Ertl wrote:
Rather, have something like an explicit "__unaligned" keyword or
similar, and then use the runtime call for these pointers.
Though "memcpy()" is usually a "simple to fix up" scenario.
On Sun, 2 Feb 2025 22:44:13 +0000, Chris M. Thomasson wrote:
On 2/2/2025 10:51 AM, MitchAlsup1 wrote:-------------
On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
I don't think there are line straddle consequences for coherence because >>>> there is no ordering guarantees for misaligned accesses.
Generally stated as:: Misaligned accesses cannot be considered ATOMIC.
Try it on an x86/x64. Straddle a l2 cache line and use it with a LOCK'ed
RMW. It should assert the BUS lock.
Consider this approach when you have a cabinet of slid in servers,
each server having 128 cores, the cabinet being cache coherent,
and the cabinet having 4096 cores.
Can you say "it donna scale" ??
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 2 Feb 2025 22:44:13 +0000, Chris M. Thomasson wrote:
On 2/2/2025 10:51 AM, MitchAlsup1 wrote:-------------
On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
I don't think there are line straddle consequences for coherence
because there is no ordering guarantees for misaligned accesses.
Generally stated as:: Misaligned accesses cannot be considered
ATOMIC.
Try it on an x86/x64. Straddle a l2 cache line and use it with a
LOCK'ed RMW. It should assert the BUS lock.
Consider this approach when you have a cabinet of slid in servers,
each server having 128 cores, the cabinet being cache coherent,
and the cabinet having 4096 cores.
Can you say "it donna scale" ??
We (3Leaf Systems) learned that the hard way 20 years ago. AMD and
Intel processors will sometimes assert the BUS lock under high
contention for a target cache line, even in cases where the access is
aligned and doesn't straddle a page boundary.
On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
As you can see in the article below, the cost of NOT handling misaligned
accesses in hardware is quite high in cpu clocks.
To my eye, the incremental cost of adding hardware support for
misaligned
to the AGU and cache data path should be quite low. The alignment
shifter
is basically the same: assuming a 64-byte cache line, LD still has to
shift any of the 64 bytes into position 0, and reverse for ST.
A handful of gates to detect misalignedness and recognize the line and
page crossing misalignments.
The alignment shifters are twice as big.
Now, while I accept these costs, I accept that others may not. I accept
these costs because of the performance issues when I don't.
The incremental cost is in a sequencer in the AGU for handling cache
line and possibly virtual page straddles, and a small byte shifter to
left shift the high order bytes. The AGU sequencer needs to know if the
line straddles a page boundary, if not then increment the 6-bit physical
line number within the 4 kB physical frame number, if yes then increment
virtual page number and TLB lookup again and access the first line.
(Slightly more if multiple page sizes are supported, but same idea.)
For a load AGU merges the low and high fragments and forwards.
I don't think there are line straddle consequences for coherence because
there is no ordering guarantees for misaligned accesses.
Generally stated as:: Misaligned accesses cannot be considered ATOMIC.
On Mon, 03 Feb 2025 13:49:46 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 2 Feb 2025 22:44:13 +0000, Chris M. Thomasson wrote:We (3Leaf Systems) learned that the hard way 20 years ago. AMD and
On 2/2/2025 10:51 AM, MitchAlsup1 wrote:-------------
On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
Consider this approach when you have a cabinet of slid in servers,Try it on an x86/x64. Straddle a l2 cache line and use it with aI don't think there are line straddle consequences for coherenceGenerally stated as:: Misaligned accesses cannot be considered
because there is no ordering guarantees for misaligned accesses.
ATOMIC.
LOCK'ed RMW. It should assert the BUS lock.
each server having 128 cores, the cabinet being cache coherent,
and the cabinet having 4096 cores.
Can you say "it donna scale" ??
Intel processors will sometimes assert the BUS lock under high
contention for a target cache line, even in cases where the access is
aligned and doesn't straddle a page boundary.
According to my understanding, last Intel or AMD processor that had
physical bus lock signal was released in Sep 2008. Likely not many
still left operating and even fewer used in production.
MitchAlsup1 wrote:
On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
As you can see in the article below, the cost of NOT handling misaligned >>> accesses in hardware is quite high in cpu clocks.
To my eye, the incremental cost of adding hardware support for
misaligned
to the AGU and cache data path should be quite low. The alignment
shifter
is basically the same: assuming a 64-byte cache line, LD still has to
shift any of the 64 bytes into position 0, and reverse for ST.
A handful of gates to detect misalignedness and recognize the line and
page crossing misalignments.
The alignment shifters are twice as big.
Oh, right, twice the muxes and wires but the critical path length
should be the same - whatever a 64:1 mux is (3 gate delays?).
So the larger aligner for misaligned shouldn't slow down the whole cache
and penalize the normal aligned case.
Now, while I accept these costs, I accept that others may not. I accept
these costs because of the performance issues when I don't.
The incremental cost is in a sequencer in the AGU for handling cache
line and possibly virtual page straddles, and a small byte shifter to
left shift the high order bytes. The AGU sequencer needs to know if the
line straddles a page boundary, if not then increment the 6-bit physical >>> line number within the 4 kB physical frame number, if yes then increment >>> virtual page number and TLB lookup again and access the first line.
(Slightly more if multiple page sizes are supported, but same idea.)
For a load AGU merges the low and high fragments and forwards.
I don't think there are line straddle consequences for coherence because >>> there is no ordering guarantees for misaligned accesses.
Generally stated as:: Misaligned accesses cannot be considered ATOMIC.
That too (I thought of that after hitting send).
What I was thinking of was: are there any coherence ordering issues if
in order to take advantage of the cache's access pipeline,
the AGU issues both accesses at once, low fragment first, high second,
and the cache has hit-under-miss, and the low fragment misses while
the high fragment hits, as the effect would be the equivalent of a
LD-LD or ST-ST bypass.
I don't immediately see a problem, but if there were then AGU would have
to do each fragment synchronously which would double the access latency
for misaligned loads.
BGB <cr88192@gmail.com> writes:
On 2/3/2025 12:55 AM, Anton Ertl wrote:
Rather, have something like an explicit "__unaligned" keyword or
similar, and then use the runtime call for these pointers.
There are people who think that it is ok to compile *p to anything if
p is not aligned, even on architectures that support unaligned
accesses. At least one of those people recommended the use of
memcpy(..., ..., sizeof(...)). Let's see what gcc produces on
rv64gc (where unaligned accesses are guaranteed to work):
[fedora-starfive:/tmp:111378] cat x.c
#include <string.h>
long uload(long *p)
{
long x;
memcpy(&x,p,sizeof(long));
return x;
}
[fedora-starfive:/tmp:111379] gcc -O -S x.c
[fedora-starfive:/tmp:111380] cat x.s
.file "x.c"
.option nopic
.text
.align 1
.globl uload
.type uload, @function
uload:
addi sp,sp,-16
lbu t1,0(a0)
BGB <cr88192@gmail.com> writes:
On 2/2/2025 10:45 AM, EricP wrote:
Digging deeper with performance counters reveals executing each unaligned >>> load instruction results in ~505 executed instructions. P550 almostAn emulation fault, or something similarly nasty...
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
At that point, even turning any potentially unaligned load or store into
a runtime call is likely to be a lot cheaper.
There are lots of potentially unaligned loads and stores. There are
very few actually unaligned loads and stores: On Linux-Alpha every
unaligned access is logged by default, and the number of
unaligned-access entries in the logs of our machines was relatively
small (on average a few per day). So trapping actual unaligned
accesses was faster than replacing potential unaligned accesses with
code sequences that synthesize the unaligned access from aligned
accesses.
Of course, if the cost of unaligned accesses is that high, you will
avoid them in cases like block copies where cheap unaligned accesses
would otherwise be beneficial.
- anton
Michael S wrote:
On Mon, 03 Feb 2025 13:49:46 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 2 Feb 2025 22:44:13 +0000, Chris M. Thomasson wrote:We (3Leaf Systems) learned that the hard way 20 years ago. AMD and
On 2/2/2025 10:51 AM, MitchAlsup1 wrote:-------------
On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
Consider this approach when you have a cabinet of slid in servers,Try it on an x86/x64. Straddle a l2 cache line and use it with aI don't think there are line straddle consequences for coherence >>>>>>> because there is no ordering guarantees for misaligned accesses. >>>>>>>Generally stated as:: Misaligned accesses cannot be considered
ATOMIC.
LOCK'ed RMW. It should assert the BUS lock.
each server having 128 cores, the cabinet being cache coherent,
and the cabinet having 4096 cores.
Can you say "it donna scale" ??
Intel processors will sometimes assert the BUS lock under high
contention for a target cache line, even in cases where the access is
aligned and doesn't straddle a page boundary.
According to my understanding, last Intel or AMD processor that had
physical bus lock signal was released in Sep 2008. Likely not many
still left operating and even fewer used in production.
Both Intel and AMD current manuals refer to system wide bus locks under >certain conditions, such as a LOCK RMW operation that straddles cache
lines in order to guarantee backwards compatible system wide atomicity. >Though the actual "bus locking" is likely done by broadcasting messages
on the coherence network rather than a LOCK# wire that runs to all cores.
That is fine for code that is being actively maintained and backward
data structure compatibility is not required (like those inside a kernel).
However for x86 there was a few billion lines of legacy code that likely assumed 2-byte alignment, or followed the fp64 aligned to 32-bits advice,
and a C language that mandates structs be laid out in memory exactly as specified (no automatic struct optimization). Also I seem to recall some amount of squawking about SIMD when it required naturally aligned buffers.
As SIMD no longer requires alignment, presumably code no longer does so.
Also in going from 32 to 64 bits, data structures that contain pointers
now could find those 8-byte pointers aligned on 4-byte boundaries.
While the Linux kernel may not use many misaligned values,
I'd guess there is a lot of application code that does.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
The OS must also be able to keep both pages in physical memory until
the access is complete, or there will be no progress. Should not be a
problem these days, but the 48 pages or so potentially needed by VAX
complicated the OS.
48 pages? What instruction would need that?
48 pages? What instruction would need that?I've seen it somewhere but dont't remember where:
One candidate would be the POLY (spelling?) polynomial evaluator with
all the arguments (indirectly?) loaded from misaligned addresses, all >straddling page bounaries?
According to Terje Mathisen <terje.mathisen@tmsw.no>:
48 pages? What instruction would need that?I've seen it somewhere but dont't remember where:
One candidate would be the POLY (spelling?) polynomial evaluator with
all the arguments (indirectly?) loaded from misaligned addresses, all >>straddling page bounaries?
No, POLY only had three arguments, the argument, the degree, and the
table of multipliers. The table could be arbitrarily long but the instruction was restartable, saving the partial result on the stack
and setting the FPD (first part done) flag for when it resumes so it
only had to be able to load one table entry at a time.
MOVTC or MOVTUC were the worst, with six arguments, all of which could
have an indirect address and five of which could cross page
boundaries.
But it occurs to me that those instructions are also restartable, so
that only a single byte of the source and destination arguments need
to be addressable at a time. There's six possible indirect adddresses
which can cross page boundaries for 12 pages, two lengths and a table
that can cross a page boundary for six more, and the source and
destination and fill, three more, and the instruction, two more.
That's a total of 23 pages, double it for the P0 or P1 page tables,
and it's only 46 pages.
That's still kind of a lot.
On 2/3/2025 1:41 PM, Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
That is fine for code that is being actively maintained and backward
data structure compatibility is not required (like those inside a
kernel).
However for x86 there was a few billion lines of legacy code that likely >>> assumed 2-byte alignment, or followed the fp64 aligned to 32-bits
advice,
and a C language that mandates structs be laid out in memory exactly as
specified (no automatic struct optimization). Also I seem to recall some >>> amount of squawking about SIMD when it required naturally aligned
buffers.
As SIMD no longer requires alignment, presumably code no longer does so.
Looking at Intel's optimization manual, they state in
"15.6 DATA ALIGNMENT FOR INTEL® AVX"
"Assembly/Compiler Coding Rule 65. (H impact, M generality) Align
data to 32-byte boundary when possible. Prefer store alignment
over load alignment."
and further down, about AVX-512,
"18.23.1 Align Data to 64 Bytes"
"Aligning data to vector length is recommended. For best results,
when using Intel AVX-512 instructions, align data to 64 bytes.
When doing a 64-byte Intel AVX-512 unaligned load/store, every
load/store is a cache-line split, since the cache-line is 64
bytes. This is double the cache line split rate of Intel AVX2
code that uses 32-byte registers. A high cache-line split rate in
memory-intensive code can cause poor performance."
This sounds reasonable, and good advice if you want to go
down SIMD lane.
This is, ironically, a place where SIMD via ganged registers has an
advantage over SIMD via large monolithic registers.
That's a total of 23 pages, double it for the P0 or P1 page tables,
and it's only 46 pages.
That's still kind of a lot.
Basically, VAX taught us why we did not want to do "all that" in
a single instruction;
while Intel 432 taught us why we did not bit
aligned decoders (and a lot of other things).
Basically, VAX taught us why we did not want to do "all that" in
a single instruction; while Intel 432 taught us why we did not bit
aligned decoders (and a lot of other things).
keylocks3.c compiles without warning on clang, but the result usually segfaults (but sometime does not, e.g., in the timed run on Zen4; it segfaults in other runs on Zen4). I have not investigated why this
happens, I just did not include results from runs where it segfaulted;
and I tried additional runs for keylocks3-512 on Zen4 in order to have
one result there.
Thomas Koenig <tkoenig@netcologne.de> writes:
http://gcc.gnu.org/bugzilla is your friend.
In my experience it's a waste of time:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93765
That is stuck in WAITING.
Marcus wrote:
On 2025-02-03, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 2/2/2025 10:45 AM, EricP wrote:
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses. >>>>> Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
An emulation fault, or something similarly nasty...
At that point, even turning any potentially unaligned load or store into >>>> a runtime call is likely to be a lot cheaper.
There are lots of potentially unaligned loads and stores. There are
very few actually unaligned loads and stores: On Linux-Alpha every
unaligned access is logged by default, and the number of
unaligned-access entries in the logs of our machines was relatively
small (on average a few per day). So trapping actual unaligned
accesses was faster than replacing potential unaligned accesses with
code sequences that synthesize the unaligned access from aligned
accesses.
If you compile regular C/C++ code that does not intentionally do any
nasty stuff, you will typically have zero unaligned loads stores.
My machine still does not support unaligned accesses in hardware (it's
on the todo list), and it can run an awful lot of software without
problems.
The problem arises when the programmer *deliberately* does unaligned
loads and stores in order to improve performance. Or rather, if the
programmer knows that the hardware supports unaligned loads and stores,
he/she can use that to write faster code in some special cases.
No, the real problem is when a compiler want to auto-vectorize any code working with 1/2/4/8 byte items: All of a sudden the alignment
requirement went from the item stride to the vector register stride
(16/32/64 bytes).
The only way this can work is to have the compiler control _all_
allocations to make sure they are properly aligned, including code in libraries, or the compiler will be forced to use vector load/store
operations which do allow unaligned access.
Terje
On Mon, 17 Feb 2025 9:37:57 +0000, Terje Mathisen wrote:
Marcus wrote:
On 2025-02-03, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 2/2/2025 10:45 AM, EricP wrote:
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost >>>>>> certainly doesn’t have hardware support for unaligned >>>>>> accesses.
Rather, it’s likely raising a fault and letting an >>>>>> operating system
handler emulate it in software."
An emulation fault, or something similarly nasty...
At that point, even turning any potentially unaligned load or store >>>>> into
a runtime call is likely to be a lot cheaper.
There are lots of potentially unaligned loads and stores. There are >>>> very few actually unaligned loads and stores: On Linux-Alpha every
unaligned access is logged by default, and the number of
unaligned-access entries in the logs of our machines was relatively
small (on average a few per day). So trapping actual unaligned
accesses was faster than replacing potential unaligned accesses with
code sequences that synthesize the unaligned access from aligned
accesses.
If you compile regular C/C++ code that does not intentionally do any
nasty stuff, you will typically have zero unaligned loads stores.
My machine still does not support unaligned accesses in hardware (it's
on the todo list), and it can run an awful lot of software without
problems.
The problem arises when the programmer *deliberately* does unaligned
loads and stores in order to improve performance. Or rather, if the
programmer knows that the hardware supports unaligned loads and stores,
he/she can use that to write faster code in some special cases.
No, the real problem is when a compiler want to auto-vectorize any code
working with 1/2/4/8 byte items: All of a sudden the alignment
requirement went from the item stride to the vector register stride
(16/32/64 bytes).
If you provide misaligned access to SIMD registers, why not provide misaligned access to all memory references !?!
I made this argument several times in my career.
The only way this can work is to have the compiler control _all_
allocations to make sure they are properly aligned, including code in
libraries, or the compiler will be forced to use vector load/store
operations which do allow unaligned access.
Either the entire environment has to be "air tight" or the HW
provides misaligned access at low cost. {{Good luck on the air
tight thing...}}
MitchAlsup1 wrote:
On Mon, 17 Feb 2025 9:37:57 +0000, Terje Mathisen wrote:
Marcus wrote:
On 2025-02-03, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 2/2/2025 10:45 AM, EricP wrote:
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost >>>>>>> certainly doesn’t have hardware support for unaligned >>>>>>> accesses.
Rather, it’s likely raising a fault and letting an >>>>>>> operating system
handler emulate it in software."
An emulation fault, or something similarly nasty...
At that point, even turning any potentially unaligned load or store >>>>>> into
a runtime call is likely to be a lot cheaper.
There are lots of potentially unaligned loads and stores. There are >>>>> very few actually unaligned loads and stores: On Linux-Alpha every
unaligned access is logged by default, and the number of
unaligned-access entries in the logs of our machines was relatively
small (on average a few per day). So trapping actual unaligned
accesses was faster than replacing potential unaligned accesses with >>>>> code sequences that synthesize the unaligned access from aligned
accesses.
If you compile regular C/C++ code that does not intentionally do any
nasty stuff, you will typically have zero unaligned loads stores.
My machine still does not support unaligned accesses in hardware (it's >>>> on the todo list), and it can run an awful lot of software without
problems.
The problem arises when the programmer *deliberately* does unaligned
loads and stores in order to improve performance. Or rather, if the
programmer knows that the hardware supports unaligned loads and stores, >>>> he/she can use that to write faster code in some special cases.
No, the real problem is when a compiler want to auto-vectorize any code
working with 1/2/4/8 byte items: All of a sudden the alignment
requirement went from the item stride to the vector register stride
(16/32/64 bytes).
If you provide misaligned access to SIMD registers, why not provide
misaligned access to all memory references !?!
I made this argument several times in my career.
The only way this can work is to have the compiler control _all_
allocations to make sure they are properly aligned, including code in
libraries, or the compiler will be forced to use vector load/store
operations which do allow unaligned access.
Either the entire environment has to be "air tight" or the HW
provides misaligned access at low cost. {{Good luck on the air
tight thing...}}
This is just one of many details where we've agreed for a decade or two (three?). Some of them you persuaded me you were right, I don't remember
any obvious examples of the opposite, but most we figured out
independently. :-)
Terje
On 2/14/2025 3:52 PM, MitchAlsup1 wrote:------------
It would take LESS total man-power world-wide and over-time to
simply make HW perform misaligned accesses.
I think the usual issue is that on low-end hardware, it is seen as
"better" to skip out on misaligned access in order to save some cost in
the L1 cache.
Though, not sure how this mixes with 16/32 ISAs, given if one allows misaligned 32-bit instructions, and a misaligned 32-bit instruction to
cross a cache-line boundary, one still has to deal with essentially the
same issues.
Another related thing I can note is internal store-forwarding within the
L1 D$ to avoid RAW and WAW penalties for multiple accesses to the same
cache line.
Say, it less convoluted to do, say:
MOV.X R24, (SP, 0)
MOV.X R26, (SP, 16)
MOV.X R28, (SP, 32)
MOV.X R30, (SP, 48)
Then again, I have heard that apparently there are libraries that rely
on the global-rounding-mode behavior, but I have also heard of such
libraries having issues or non-determinism when mixed with other
libraries which try to set a custom rounding mode when these modes
disagree.
I prefer my strategy instead:
FADD/FSUB/FMUL:
Hard-wired Round-Nearest / RNE.
Does not modify FPU flags.
FADDG/FSUBG/FMULG:
Dynamic Rounding;
May modify FPU flags.
Can note that RISC-V burns 3 bits for FPU instructions always encoding a rounding mode (whereas in my ISA, encoding a rounding mode other than
RNE or DYN requiring a 64-bit encoding).
On Sun, 23 Feb 2025 11:13:53 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
It looks to me that Vivado intends that after you get your basic
design working, this module optimization is *exactly* what one is
supposed to do.
In this case the prototype design establishes that you need multiple
64-bit adders and the generic ones synthesis spits out are slow.
So you isolate that module off, use Verilog to drive the basic LE
selections, then iterate doing relative LE placement specifiers,
route the module, and when you get the fastest 64-bit adder you can
then lock down the netlist and save the module design.
Now you have a plug-in 64-bit adder module that runs at (I don't know
the speed difference between Virtex and your Spartan-7 so wild guess)
oh, say, 4 ns, to use multiple places... fetch, decode, alu, agu.
Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
isolate that module, optimize placement, route, lock down netlist,
and now you have a 5 ns plug-in ALU module.
Doing this you build up your own IP library of optimized hardware
modules.
As more and more modules are optimized the system synthesis gets
faster because much of the fine grain work and routing is already
done.
It sounds like your 1st hand FPGA design experience is VERY outdated.
Michael S wrote:
On Sun, 23 Feb 2025 11:13:53 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
It looks to me that Vivado intends that after you get your basic
design working, this module optimization is *exactly* what one is
supposed to do.
In this case the prototype design establishes that you need
multiple 64-bit adders and the generic ones synthesis spits out
are slow. So you isolate that module off, use Verilog to drive the
basic LE selections, then iterate doing relative LE placement
specifiers, route the module, and when you get the fastest 64-bit
adder you can then lock down the netlist and save the module
design.
Now you have a plug-in 64-bit adder module that runs at (I don't
know the speed difference between Virtex and your Spartan-7 so
wild guess) oh, say, 4 ns, to use multiple places... fetch,
decode, alu, agu.
Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
isolate that module, optimize placement, route, lock down netlist,
and now you have a 5 ns plug-in ALU module.
Doing this you build up your own IP library of optimized hardware
modules.
As more and more modules are optimized the system synthesis gets
faster because much of the fine grain work and routing is already
done.
It sounds like your 1st hand FPGA design experience is VERY
outdated.
Never have, likely never will.
Nothing against them - looks easier than wire-wrapping TTL and 4000
CMOS. Though people do seem to spend an awful lot of time working
around certain deficiencies like the lack of >1 write ports on
register files, and the lack of CAM's. One would think market forces
would induce at least one supplier to add these and take the fpga
market by storm.
Also fpga's do seem prone to monopolistic locked-in pricing
(though not really different from any relational database vendor).
At least with TTL one could do an RFQ to 5 or 10 different suppliers.
I'm just trying to figure out what these other folks are doing to get bleeding edge performance from essentially the same tools and similar
chips.
I assume you are referring to the gui IDE interface for things like
floor planning where you click on a LE cells and set some attributes.
I also think I saw reference to locking down parts of the net list.
But there are a lot of documents to go through.
On 2025-02-24 12:28 p.m., Michael S wrote:
On Mon, 24 Feb 2025 11:52:38 -0500Respecting I do not know that much about the work environment of FPGA developers:
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Michael S wrote:
On Sun, 23 Feb 2025 11:13:53 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
It looks to me that Vivado intends that after you get your basic
design working, this module optimization is *exactly* what one is
supposed to do.
In this case the prototype design establishes that you need
multiple 64-bit adders and the generic ones synthesis spits out
are slow. So you isolate that module off, use Verilog to drive the
basic LE selections, then iterate doing relative LE placement
specifiers, route the module, and when you get the fastest 64-bit
adder you can then lock down the netlist and save the module
design.
Now you have a plug-in 64-bit adder module that runs at (I don't
know the speed difference between Virtex and your Spartan-7 so
wild guess) oh, say, 4 ns, to use multiple places... fetch,
decode, alu, agu.
Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
isolate that module, optimize placement, route, lock down netlist,
and now you have a 5 ns plug-in ALU module.
Doing this you build up your own IP library of optimized hardware
modules.
As more and more modules are optimized the system synthesis gets
faster because much of the fine grain work and routing is already
done.
It sounds like your 1st hand FPGA design experience is VERY
outdated.
Never have, likely never will.
Nothing against them - looks easier than wire-wrapping TTL and 4000
CMOS. Though people do seem to spend an awful lot of time working
around certain deficiencies like the lack of >1 write ports on
register files, and the lack of CAM's. One would think market forces
would induce at least one supplier to add these and take the fpga
market by storm.
Your view is probably skewed by talking to soft core hobbyists.
Please realize that most professionals do not care about
high-performance soft core. Soft core is for control plane functions
rather than for data plane. Important features are ease of use,
reliability, esp. of software tools and small size. Performance is
rated low. Performance per clock is rated even lower. So, professional
do not develop soft cores by themselves. And OTS cores that they use
are not superscalar. Quite often not even fully pipelined.
It means, no, small SRAM banks with two independent write ports is not
a feature that FPGA pros would be excited about.
Also fpga's do seem prone to monopolistic locked-in pricing
(though not really different from any relational database vendor).
Cheap Chinese clones of X&A FPGAs from late 2000s and very early 2010s
certainly exist. I didn't encounter Chinese clones of slightly newer
devices, like Xilinx 7-series. But I didn't look hard for them. So,
wouldn't be surprised if they exist, too.
Right now, and almost full decade back, neither X nor A cares about low
end. They just continue to ship old chips, mostly charging old price or
rising a little.
At least with TTL one could do an RFQ to 5 or 10 different suppliers.
I'm just trying to figure out what these other folks are doing to get
bleeding edge performance from essentially the same tools and similar
chips.
I assume you are referring to the gui IDE interface for things like
floor planning where you click on a LE cells and set some attributes.
I also think I saw reference to locking down parts of the net list.
But there are a lot of documents to go through.
No, I mean florplanning, as well as most other manual physical-level
optimization are not used at all in 99% percents of FPGA designs that
started after year 2005.
I have thought of FPGAs as more of a prototyping tool, or to be used in one-off designs, proof-of-concept type things. In those cases one
probably does not care too much about manual operations, as was said one would be more interested in productivity of developers that comes from reliable tools and being able to deal with things at a high level.
The vendor’s have a number of pre-made components that can be plugged
into a design making it possible to sketch out a design very quickly
with a couple of caveats. One being one might be stuck to a particular vendor.
CAMs can easily be implemented in FPGAs although they may have
multi-cycle latency.
One has only to research CAM implementation in
FPGAs. Register files with multiple ports are easily implemented with replication.
It may be nice to see a CAM component in a vendor library. Register files sometimes have bypassing requirements that might make it challenging to develop a generic component.
On Sat, 08 Feb 2025 17:46:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Or by my own pasting mistake. I am still not sure whom to blame.
The mistake was tiny - absence of // at the begining of one line,
but enough to not compile. Trying it for a second time:
Now it's worse, it's quoted-printable. E.g.:
if (li >=3D len || li <=3D 0)
Some newsreaders can decode this, mine does not.
First cycles (which eliminates worries about turbo modes) and
instructions, then usec/call.
=20
I don't understand that.
For original code optimized by clang I'd expect 22,000 cycles and
5.15 usec per call on Haswell. You numbers don't even resamble
anything like that.
My cycle numbers are for the whole program that calls keylocks()
100_000 times.
If you divide the cycles by 100000, you get 21954 for clang
keylocks1-256, which is what you expect.
instructions
5_779_542_242 gcc avx2 1 =20
3_484_942_148 gcc avx2 2 8=20
5_885_742_164 gcc avx2 3 8=20
7_903_138_230 clang avx2 1 =20
7_743_938_183 clang avx2 2 8?
3_625_338_104 clang avx2 3 8?=20
4_204_442_194 gcc 512 1 =20
2_564_142_161 gcc 512 2 32
3_061_042_178 gcc 512 3 16
7_703_938_205 clang 512 1 =20
3_402_238_102 clang 512 2 16?
3_320_455_741 clang 512 3 16?
=20
I don't understand these numbers either. For original clang, I'd
expect 25,000 instructions per call.
clang keylocks1-256 performs 79031 instructions per call (divide the
number given by 100000 calls). If you want to see why that is, you
need to analyse the code produced by clang, which I did only for
select cases.
Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
Which could be due to differences in measurements methodology - I >reported median of 11 runs, you seems to report average.
I just report one run with 100_000 calls, and just hope that the
variation is small:-) In my last refereed paper I use 30 runs and
median, but I don't go to these lengths here; the cycles seem pretty repeatable.
On the Golden Cove of a Core i3-1315U (compared to the best
result by Terje Mathisen on a Core i7-1365U; the latter can run
up to 5.2GHz according to Intel, whereas the former can
supposedly run up to 4.5GHz; I only ever measured at most 3.8GHz
on our NUC, and this time as well):
=20
I always thought that NUCs have better cooling than all, but
high-end laptops. Was I wrong? Such slowness is disappointing.
The cooling may be better or not, that does not come into play here,
as it never reaches higher clocks, even when it's cold; E-cores also
stay 700MHz below their rated turbo speed, even when it's the only
loaded core. One theory I have is that one option we set up in the
BIOS has the effect of limiting turbo speed, but it has not been
important enough to test.
5.25us Terje Mathisen's Rust code compiled by clang (best on the
1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U
=20
So, for the best-performing variant IPC of Goldeen Cove is
identical to ancient Haswell?
Actually worse:
For clang keylocks2-512 Haswell has 3.73 IPC, Golden Cove 3.63.
That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to
throughput of two per clock. Golden Cove has 5+ wide front end and
nearly all AVX2 integer instruction have throughput of three per
clock. Could it be that clang introduced some sort of latency
bottleneck?
As far as I looked into the code, I did not see such a bottleneck.
Also, Zen4 has significantly higher IPC on this variant (5.36 IPC
for clang keylocks2-256), and I expect that it would suffer from a
general latency bottleneck, too. Rocket Lake is also faster on
this program than Haswell and Golden Cove. It seems to be just
that this program rubs Golden Cove the wrong way.
I would have expected the clang keylocks1-256 to run slower,
because the compiler back-end is the same and the 1315U is
slower. Measuring cycles looks more relevant for this benchmark
to me than measuring time, especially on this core where AVX-512
is disabled and there is no AVX slowdown.
=20
I prefer time, because at the end it's the only thing that matter.
True, and certainly, when stuff like AVX-512 license-based
downclocking or thermal or power limits come into play (and are
relevant for the measurement at hand), one has to go there. But
then you can only compare code running on the same kind of machine, configured the same way. Or maybe just running on the same
machine:-). But then, the generality of the results is
questionable.
- anton
Back to original question of the cost of misalignment.
I modified original code to force alignment in the inner loop:
#include <stdint.h>
#include <string.h>
int foo_tst(const uint32_t* keylocks, int len, int li)
{
if (li <= 0 || len <= li)
return 0;
int lix = (li + 31) & -32;
_Alignas(32) uint32_t tmp[lix];
memcpy(tmp, keylocks, li*sizeof(*keylocks));
if (lix > li)
memset(&tmp[li], 0, (lix-li)*sizeof(*keylocks));
int res = 0;
for (int i = li; i < len; ++i) {
uint32_t lock = keylocks[i];
for (int k = 0; k < lix; ++k)
res += (lock & tmp[k])==0;
}
return res - (lix-li)*(len-li);
}
Compiled with 'clang -O3 -march=haswell'
On the same Haswell Xeon it runs at 2.841 usec/call, i.e. almost
twice faster than original and only 1.3x slower than horizontally
unrolled variants.
So, at least on Haswell, unaligned AVX256 loads are slow.
It takes Round Nearest Odd to perform Kahan-Babashuka Summation.
That is:: comply with IEEE 754-2019
I'd say, comply with mandatory requirements of IEEE 754-2019.
For optional requirements, be selective. Prefer those that can be
accessed from widespread languages (including incoming editions of
language standards) over the rest.
On Tue, 18 Feb 2025 02:55:33 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
It takes Round Nearest Odd to perform Kahan-Babashuka Summation.
Are you aware of any widespread hardware that supplies Round to Nearest
with tie broken to Odd? Or of any widespread language that can request
such rounding mode?
Until both, implementing RNO on niche HW looks to me as wastage of both
HW resources and of space in your datasheet.
On 2/17/2025 11:07 PM, Robert Finch wrote:
On 2025-02-17 8:00 p.m., BGB wrote:
On 2/14/2025 3:52 PM, MitchAlsup1 wrote:I always include support for unaligned accesses even with a ‘low-end’
On Fri, 14 Feb 2025 21:14:11 +0000, BGB wrote:
On 2/13/2025 1:09 PM, Marcus wrote:-------------
The problem arises when the programmer *deliberately* does unaligned >>>>>> loads and stores in order to improve performance. Or rather, if the >>>>>> programmer knows that the hardware supports unaligned loads and
stores,
he/she can use that to write faster code in some special cases.
Pretty much.
This is partly why I am in favor of potentially adding explicit
keywords
for some of these cases, or to reiterate:
__aligned:
Inform compiler that a pointer is aligned.
May use a faster version if appropriate.
If a faster aligned-only variant exists of an instruction. >>>>> On an otherwise unaligned-safe target.
__unaligned: Inform compiler that an access is unaligned.
May use a runtime call or similar if necessary,
on an aligned-only target.
May do nothing on an unaligned-safe target.
None: Do whatever is the default.
Presumably, assume aligned by default,
unless target is known unaligned-safe.
It would take LESS total man-power world-wide and over-time to
simply make HW perform misaligned accesses.
I think the usual issue is that on low-end hardware, it is seen as
"better" to skip out on misaligned access in order to save some cost
in the L1 cache.
CPU. I think it is not that expensive and sure makes some things a lot
easier when handled in hardware. For Q+ it just runs two bus cycles if
the data spans a cache line and pastes results together as needed.
I had went aligned-only with some 32-bit cores in the past.
Whole CPU core fit into less LUTs than I currently spend on just the L1
D$...
Granted, some of these used a very minimal L1 cache design:
Only holds a single cache line.
The smallest cores I had managed had used a simplified SH-based design:
Fixed-length 16 bit instructions, with 16 registers;
Only (Reg) and (Reg, R0) addressing;
Aligned only;
No shift or multiply;
Where, say:
SH-4 -> BJX1-32 (Added features)
SH-4 -> B32V (Stripped down)
BJX1-32 -> BJX1-64A (64-bit, Modal Encoding)
B32V -> B64V (64-bit, Encoding Space Reorganizations)
B64V ~> BJX1-64C (No longer Modal)
Where, BJX1-64C was the end of this project (before I effectively did a soft-reboot).
Then transition phase:
B64V -> BtSR1 (Dropped to 32-bit, More Encoding Changes)
Significant reorganization.
Was trying to get optimize for code density closer to MSP430.
BtSR1 -> BJX2 (Back to 64-bit, re-adding features from BJX1-64C)
A few features added for BtSR1 were dropped again in BJX2.
The original form of BJX2 was still a primarily 16-bit ISA encoding, but
at this point pretty much mutated beyond recognition (and relatively few instructions were still in the same places that they were in SH-4).
For example (original 16-bit space):
0zzz:
SH-4: Ld/St (Rm,R0); also 0R and 1R spaces, etc.
BJX2: Ld/St Only (Rm) and (Rm,R0)
1zzz:
SH-4: Store (Rn, Disp4)
BJX2: 2R ALU ops
2zzz:
SH-4: Store (@Rn, @-Rn), ALU ops
BJX2: Branch Ops (Disp8), etc
3zzz:
SH-4: ALU ops
BJX2: 0R and 1R ops
4zzz:
SH-4: 1R ops
BJX2: Ld/St (SP, Disp4); MOV-CR, LEA
5zzz:
SH-4: Load (Rm, Disp4)
BJX2: Load (Unsigned), ALU ops
6zzz:
SH-4: Load (@Rm+ and @Rm), ALU
BJX2: FPU ops, CMP-Imm4
7zzz:
SH-4: ADD Imm8, Rn
BJX2: (XGPR 32-bit Escape Block)
8zzz:
SH-4: Branch (Disp8)
BJX2: Ld/St (Rm, Disp3)
9zzz:
SH-4: Load (PC-Rel)
BJX2: (XGPR 32-bit Escape Block)
Azzz:
SH-4: BRA Disp12
BJX2: MOV Imm12u, R0
Bzzz:
SH-4: BSR Disp12
BJX2: MOV Imm12n, R0
Czzz:
SH-4: Some Imm8 ops
BJX2: ADD Imm8, Rn
Dzzz:
SH-4: Load (PC-Rel)
BJX2: MOV Imm8, Rn
Ezzz:
SH-4: MOV Imm8, Rn
BJX2: (32-bit Escape, Predicated Ops)
Fzzz:
SH-4: FPU Ops
BJX2: (32-bit Escape, Unconditional Ops)
For the 16-bit ops, SH-4 had more addressing modes than BJX2:
SH-4: @Reg, @Rm+, @-Rn, @(Reg,R0), @(Reg,Disp4) @(PC,Disp8)
BJX2: (Rm), (Rm,R0), (Rm,Disp3), (SP,Disp4)
Although it may seem like it, I didn't just completely start over on the layout, but rather it was sort of an "ant-hill reorganization".
Say, for example:
1zzz and 5zzz were merged into 8zzz, reducing Disp by 1 bit
2zzz and 3zzz was partly folded into 0zzz and 1zzz
8zzz's contents were moved to 2zzz
4zzz and part of 0zzz were merged into 3zzz
...
A few CR's are still in the same places and SR still has a similar
layout I guess, ...
Early on, there was the idea that the 32-bit ops were prefix-modified versions of the 16-bit ops, but early on this symmetry broke and the 16
and 32-bit encoding spaces became independent of each other.
Though, the 32-bit F0 space still has some amount of similarity to the
16-bit space.
Later on I did some testing and performance comparisons, and realized
that using 32-bit encodings primarily (or exclusively) gave
significantly better performance than relying primarily or exclusively
on 16-bit ops. And at this point the ISA transitioned from a primarily
16-bit ISA (with 32-bit extension ops) to a primarily 32-bit ISA with a 16-bit encoding space. This transition didn't directly effect encodings,
but did effect how the ISA developed from then going forward (more so,
there was no longer an idea that the 16-bit ISA would need to be able to exist standalone; but now the 32-bit ISA did need to be able to exist standalone).
But, now newer forms of BJX2 (XG2 and XG3) have become almost
unrecognizable from early BJX2 (as an ISA still primarily built around
16-bit encodings).
Except that XG2's instruction layout still carries vestiges of its
origins as a prefix encoding. But, XG3 even makes this part disappear
(by reorganizing the bits to more closely resemble RISC-V's layout).
Well, and there is:
ZnmX -> ZXnm
But:
F0nm_ZeoX
I prefer my strategy instead:Q+ encodes rounding mode the same way as RISCV as there are lots of bit
FADD/FSUB/FMUL:
Hard-wired Round-Nearest / RNE.
Does not modify FPU flags.
FADDG/FSUBG/FMULG:
Dynamic Rounding;
May modify FPU flags.
Can note that RISC-V burns 3 bits for FPU instructions always encoding
a rounding mode (whereas in my ISA, encoding a rounding mode other
than RNE or DYN requiring a 64-bit encoding).
available in the instruction. Burning bits on the rounding mode seems
reasonable to me when bits are available.
Initially:
3 bits of entropy were eaten by the 16-bit space;
2 more bits were eaten by predication and WEX.
So, the initial ISA design for 32-bit ops had 5 less bits than in RISC-V land.
XG2 reclaimed the 16-bit space, but used the bits to expand all the
register fields to 6 bits.
Not many bits left to justify burning on a rounding mode.
And, my Imm/Disp fields were generally 3 bits less than RV.
Modified the PRED modifier in Q+ to take a predicate bit from one of
three registers used to supply bits. Previously an array of two-bit mask
values encoded in the instruction indicated to 1) ignore the predicate
bit 2) execute if predicate true or 3) execute if predicate false.
Since there were three reg specs available in the PRED modifier, it
seemed to make more sense to specify three regs instead of one. So now
it works 1) as before 2) execute if bit in Ra is set, 3) execute if bit
in Rb is set, 3) execute if bit in Rc is set.
The same register may be specified for Ra, Rb, and Rc. Since there is
sign inversion available, the original operation may be mimicked by
specifying Ra, ~Ra.
In BJX2, all 32-bit instructions encode predication in 2 bits in each instruction.
In XG3, the space that would have otherwise encoded WEX was instead left
to RISC-V (to create a conglomerate ISA).
But, there is also the possibility to use XG3 by itself without any
RISC-V parts in the mix.
BGB <cr88192@gmail.com> wrote:
The smallest cores I had managed had used a simplified SH-based design:
Fixed-length 16 bit instructions, with 16 registers;
Only (Reg) and (Reg, R0) addressing;
Aligned only;
No shift or multiply;
You mean no variable shift, or no large shifts, you have to support divide
by 2, right?
On Tue, 18 Feb 2025 13:07:39 +0000, Michael S wrote:
On Tue, 18 Feb 2025 02:55:33 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
It takes Round Nearest Odd to perform Kahan-Babashuka Summation.
Are you aware of any widespread hardware that supplies Round to Nearest
with tie broken to Odd? Or of any widespread language that can request
such rounding mode?
No, No
Until both, implementing RNO on niche HW looks to me as wastage of both
HW resources and of space in your datasheet.
They way I implement it, it is only an additional 10± gates.
MitchAlsup1 wrote:
On Tue, 18 Feb 2025 13:07:39 +0000, Michael S wrote:
On Tue, 18 Feb 2025 02:55:33 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
It takes Round Nearest Odd to perform Kahan-Babashuka Summation.
Are you aware of any widespread hardware that supplies Round to Nearest
with tie broken to Odd? Or of any widespread language that can request
such rounding mode?
No, No
Until both, implementing RNO on niche HW looks to me as wastage of both
HW resources and of space in your datasheet.
They way I implement it, it is only an additional 10± gates.
With discrete logic, it should be identical to RNE, except for flipping
the ulp bit when deciding upon the rounding direction, right?
With a full 4-bit lookup table you need a few more gates, but that is
still the obvious way to implement rounding in SW. (It is only ceil()
and floor() that requires the sign bit as input, the remaining rounding
modes can make do with ulp+guard+sticky.
Terje
Say, one could imagine an abstract model where Binary64 FADD works sortexpA-expB
of like:
sgnA=valA>>63;
sgnB=valA>>63;
expA=(valA>>52)&2047;
expB=(valB>>52)&2047;
fraA=(valA&((1ULL<<52)-1));
fraB=(valB&((1ULL<<52)-1));
if(expA!=0)fraA|=1ULL<<52;
if(expB!=0)fraB|=1ULL<<52;
fraA=fraA<<9; //9 sub ULP bits
fraB=fraB<<9;
shrA=(expB>=expA)?(expB-expA):0;
shrB=(expA>=expB)?(expA-expA):0;
sgn2A=sgnA; exp2A=expA; fra2A=fraA>>shrA;
sgn2B=sgnB; exp2B=expB; fra2B=fraB>>shrB;
//logical clock-edge here.
fr1C_A=fra2A+fra2B;
fr1C_B=fra2A-fra2B;
fr1C_C=fra2B-fra2A;
if(sgn2A^sgn2B)
{
if(fr1C_C>>63)
{ sgn1C=sgn2A; fra1C=fr1C_B; }
else
{ sgn1C=sgn2B; fra1C=fr1C_C; }
}
else
{ sgn1C=!sgn2A; fra1C=fr1C_A; }
//logical clock-edge here.
if(fra2C>>62)
{ exp3C=exp2C+1; fra3C=fra2C>>1; }
else
{ shl=clz64(fra2C)-2; exp3C=exp2C-shl; fra3C=fra2C<<shl; }
//logical clock-edge here.
if((exp3C>=2047) || (exp3C<=0))
{ sgnC=sgn2C; expC=(exp3C<=0)?0:2047; fraC=0; }
else
{
sgnC=sgn2C; expC=exp3C; fraC=fra3C>>9;
//if rounding is done, it goes here.
}
valC=(sgnC<<63)|(expC<<52)|fraC;
//final clock edge.
//result is now ready.
On Mon, 24 Feb 2025 19:52:49 +0000, Robert Finch wrote:
CAMs can easily be implemented in FPGAs although they may have
multi-cycle latency.
A CAM is a vector of XOR gate inputs that feed an AND gate.
A 5-bit CAM with valid bit is 3-gates in CMOS and 2-gates of delay.
It is only when there are lots of bits being CAMed does the latency
increase markedly -- OR when there are lots of entries being CAMed
but this is a FAN-IN buffering problem not a gate delay or gate logic problem.
One has only to research CAM implementation in
FPGAs. Register files with multiple ports are easily implemented with
replication.
Read ports can be added by replication, write ports cannot.
Respecting I do not know that much about the work environment of FPGA developers:
I have thought of FPGAs as more of a prototyping tool, or to be used in one-off designs, proof-of-concept type things. In those cases one
probably does not care too much about manual operations, as was said one would be more interested in productivity of developers that comes from reliable tools and being able to deal with things at a high level.
The vendor’s have a number of pre-made components that can be plugged
into a design making it possible to sketch out a design very quickly
with a couple of caveats. One being one might be stuck to a particular vendor.
CAMs can easily be implemented in FPGAs although they may have
multi-cycle latency. One has only to research CAM implementation in
FPGAs.
Register files with multiple ports are easily implemented with
replication.
It may be nice to see a CAM component in a vendor library.
Register files sometimes have bypassing requirements that might make it challenging to develop a generic component.
On 2/21/2025 1:51 PM, EricP wrote:
and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
Errm, skim, this doesn't really look like something you can pull off in normal Verilog.
Generally, one doesn't control over how the components hook together,
only one can influence what happens based on how they write their Verilog.
You can just write:
reg[63:0] tValA;
reg[63:0] tValB;
reg[63:0] tValC;
tValC=tValA+tValB;
But, then it spits out something with a chain of 16 CARRY4's, so there
is a fairly high latency on the high order bits of the result.
On Mon, 24 Feb 2025 11:52:38 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Michael S wrote:
On Sun, 23 Feb 2025 11:13:53 -0500Never have, likely never will.
EricP <ThatWouldBeTelling@thevillage.com> wrote:
It looks to me that Vivado intends that after you get your basic
design working, this module optimization is *exactly* what one is
supposed to do.
In this case the prototype design establishes that you need
multiple 64-bit adders and the generic ones synthesis spits out
are slow. So you isolate that module off, use Verilog to drive the
basic LE selections, then iterate doing relative LE placement
specifiers, route the module, and when you get the fastest 64-bit
adder you can then lock down the netlist and save the module
design.
Now you have a plug-in 64-bit adder module that runs at (I don't
know the speed difference between Virtex and your Spartan-7 so
wild guess) oh, say, 4 ns, to use multiple places... fetch,
decode, alu, agu.
Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
isolate that module, optimize placement, route, lock down netlist,
and now you have a 5 ns plug-in ALU module.
Doing this you build up your own IP library of optimized hardware
modules.
As more and more modules are optimized the system synthesis gets
faster because much of the fine grain work and routing is already
done.
It sounds like your 1st hand FPGA design experience is VERY
outdated.
Nothing against them - looks easier than wire-wrapping TTL and 4000
CMOS. Though people do seem to spend an awful lot of time working
around certain deficiencies like the lack of >1 write ports on
register files, and the lack of CAM's. One would think market forces
would induce at least one supplier to add these and take the fpga
market by storm.
Your view is probably skewed by talking to soft core hobbyists.
Please realize that most professionals do not care about
high-performance soft core. Soft core is for control plane functions
rather than for data plane. Important features are ease of use,
reliability, esp. of software tools and small size. Performance is
rated low. Performance per clock is rated even lower. So, professional
do not develop soft cores by themselves. And OTS cores that they use
are not superscalar. Quite often not even fully pipelined.
It means, no, small SRAM banks with two independent write ports is not
a feature that FPGA pros would be excited about.
I'm just trying to figure out what these other folks are doing to get
bleeding edge performance from essentially the same tools and similar
chips.
I assume you are referring to the gui IDE interface for things like
floor planning where you click on a LE cells and set some attributes.
I also think I saw reference to locking down parts of the net list.
But there are a lot of documents to go through.
No, I mean florplanning, as well as most other manual physical-level optimization are not used at all in 99% percents of FPGA designs that
started after year 2005.
Michael S wrote:--------------------
No, I mean florplanning, as well as most other manual physical-level
optimization are not used at all in 99% percents of FPGA designs that
started after year 2005.
Is that because the auto place and route got good enough that it is unnecessary? Or maybe the fpga resources grew enough that autoroute
didn't have to struggle to find optimal positions and paths
(being an optimal packing problem and a traveling salesman problem).
Also BGB mentioned in another thread a while back that he was getting
what sounded like random variation of critical paths from run to run.
That suggests to me the automatic tools may not be properly recognizing
the different modules and produce some non-optimal positions or paths.
So giving it a hint that "this stuff goes together" might help.
Anyway, it should be testable. Inspect the auto placements module wiring
and if there are any obviously crazy decision then try the placement
tool
an see if the speed improves or critical path variation goes away.
On Tue, 25 Feb 2025 14:20:45 +0000, EricP wrote:
Michael S wrote:--------------------
No, I mean florplanning, as well as most other manual physical-level
optimization are not used at all in 99% percents of FPGA designs that
started after year 2005.
Is that because the auto place and route got good enough that it is
unnecessary? Or maybe the fpga resources grew enough that autoroute
didn't have to struggle to find optimal positions and paths
(being an optimal packing problem and a traveling salesman problem).
Athlon (1998) used hand place auto-route. So, auto-route has been
good enough since 2000 at latest.
Also BGB mentioned in another thread a while back that he was getting
what sounded like random variation of critical paths from run to run.
That suggests to me the automatic tools may not be properly recognizing
the different modules and produce some non-optimal positions or paths.
So giving it a hint that "this stuff goes together" might help.
Consider the optimizer/place/route thingamabob; and a signal that
crosses from one module to another. The optimizer changes from
a 2-LUT delay to a 1 LUT delay, but now the fan-out of that LUT
doubles, so instead of speeding up, the signal path slows down.
On Tue, 18 Feb 2025 21:09:54 +0000, Terje Mathisen wrote:
MitchAlsup1 wrote:
On Tue, 18 Feb 2025 13:07:39 +0000, Michael S wrote:
On Tue, 18 Feb 2025 02:55:33 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
It takes Round Nearest Odd to perform Kahan-Babashuka Summation.
Are you aware of any widespread hardware that supplies Round to Nearest >>>> with tie broken to Odd? Or of any widespread language that can request >>>> such rounding mode?
No, No
Until both, implementing RNO on niche HW looks to me as wastage of both >>>> HW resources and of space in your datasheet.
They way I implement it, it is only an additional 10± gates.
With discrete logic, it should be identical to RNE, except for flipping
the ulp bit when deciding upon the rounding direction, right?
Yes,
With a full 4-bit lookup table you need a few more gates, but that is
still the obvious way to implement rounding in SW. (It is only ceil()
and floor() that requires the sign bit as input, the remaining rounding
modes can make do with ulp+guard+sticky.
sign+ULP+Gard+sticky is all you ever need for any rounding mode
IEEE or beyond.
MitchAlsup1 wrote:
On Tue, 18 Feb 2025 21:09:54 +0000, Terje Mathisen wrote:
MitchAlsup1 wrote:
On Tue, 18 Feb 2025 13:07:39 +0000, Michael S wrote:
On Tue, 18 Feb 2025 02:55:33 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
It takes Round Nearest Odd to perform Kahan-Babashuka Summation.
Are you aware of any widespread hardware that supplies Round to Nearest >>>>> with tie broken to Odd? Or of any widespread language that can request >>>>> such rounding mode?
No, No
Until both, implementing RNO on niche HW looks to me as wastage of both >>>>> HW resources and of space in your datasheet.
They way I implement it, it is only an additional 10± gates.
With discrete logic, it should be identical to RNE, except for flipping
the ulp bit when deciding upon the rounding direction, right?
Yes,
With a full 4-bit lookup table you need a few more gates, but that is
still the obvious way to implement rounding in SW. (It is only ceil()
and floor() that requires the sign bit as input, the remaining rounding
modes can make do with ulp+guard+sticky.
sign+ULP+Gard+sticky is all you ever need for any rounding mode
IEEE or beyond.
That's what I believed all through the 2019 standards process and up to
a month or two ago:
In reality, the "NearestOrEven" rounding rule has an exception if/when
you need to round the largest possible fp number, with guard=1 and
sticky=0:
I.e. exactly halfway to the next possible value (which would be Inf)
In just this particular case, the OrEven part is skipped in favor of not rounding up, so leaving a maximum/odd mantissa.
In the same case but sticky=1 we do round up to Inf.
This unfortunately means that the rounding circuit needs to be combined
with an exp+mant==0b111...111 input. :-(
Terje
On 2/19/2025 11:31 AM, MitchAlsup1 wrote:------------------
On Wed, 19 Feb 2025 16:35:41 +0000, Terje Mathisen wrote:
sign+ULP+Gard+sticky is all you ever need for any rounding mode
IEEE or beyond.
That's what I believed all through the 2019 standards process and up to
a month or two ago:
In reality, the "NearestOrEven" rounding rule has an exception if/when
you need to round the largest possible fp number, with guard=1 and
sticky=0:
I.e. exactly halfway to the next possible value (which would be Inf)
In just this particular case, the OrEven part is skipped in favor of not >>> rounding up, so leaving a maximum/odd mantissa.
In the same case but sticky=1 we do round up to Inf.
This unfortunately means that the rounding circuit needs to be combined
with an exp+mant==0b111...111 input. :-(
You should rename that mode as "Round but stay finite"
So, does it overflow?...
Admittedly part of why I have such mixed feelings on fullCon: it can't compare to a constant
compare-and-branch:
Pro: It can offer a performance advantage (in terms of per-clock);
Con: Branch is now beholden to the latency of a Subtract.
Where, detecting all zeroes is at least cheaper than a subtract. But, detecting all zeroes still isn't free (for 64b, ~ 10 LUTs and 3 LUTs
delay).
On Thu, 6 Feb 2025 21:36:38 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
BTW, when I timed 1000 calls to that 5-6 us program, to get around
teh 100 ns timer resolution, each iteration ran in 5.23 us.
That measurement could be good enough on desktop. Or not.
It certainly not good enough on laptop and even less so on server.
On laptop I wouldn't be sutisfied before I lok my program to
particualr core, then do something like 21 measurements with 100K calls
in each measurement (~10 sec total) and report median of 21.
BGB <cr88192@gmail.com> writes:
On 2/2/2025 10:45 AM, EricP wrote:
Digging deeper with performance counters reveals executing each unaligned >>> load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
An emulation fault, or something similarly nasty...
At that point, even turning any potentially unaligned load or store into
a runtime call is likely to be a lot cheaper.
There are lots of potentially unaligned loads and stores. There are
very few actually unaligned loads and stores: On Linux-Alpha every
unaligned access is logged by default, and the number of
unaligned-access entries in the logs of our machines was relatively
small (on average a few per day). So trapping actual unaligned
accesses was faster than replacing potential unaligned accesses with
code sequences that synthesize the unaligned access from aligned
accesses.
Of course, if the cost of unaligned accesses is that high, you will
avoid them in cases like block copies where cheap unaligned accesses
would otherwise be beneficial.
- anton
The problem arises when the programmer *deliberately* does unaligned
loads and stores in order to improve performance. Or rather, if the programmer knows that the hardware supports unaligned loads and stores, he/she can use that to write faster code in some special cases.
On Fri, 7 Feb 2025 15:23:51 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Fri, 7 Feb 2025 11:06:43 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Thu, 6 Feb 2025 21:36:38 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
BTW, when I timed 1000 calls to that 5-6 us program, to get
around teh 100 ns timer resolution, each iteration ran in 5.23
us.
That measurement could be good enough on desktop. Or not.
It certainly not good enough on laptop and even less so on server.
On laptop I wouldn't be sutisfied before I lok my program to
particualr core, then do something like 21 measurements with 100K
calls in each measurement (~10 sec total) and report median of
21.
Each measurement did 1000 calls, then I ran 100 such measurements.
The 5.23 us value was the lowest seen among the 100, with average a
bit more:
Slowest: 9205200 ns
Fastest: 5247500 ns
Average: 5672529 ns/iter
Part1 = 3338
My own (old, but somewhat kept up to date) cputype program reported
that it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to
CPUID.
Is that sufficient to judge the performance?
Terje
Not really.
i7-1365U is a complicated beast. 2 "big" cores, 8 "medium" cores.
Frequency varies ALOT, 1.8 to 5.2 GHz on "big", 1.3 to 3.9 GHz on
"medium".
OK. It seems like the big cores are similar to what I've had
previously, i.e. each core supports hyperthreading, while the medium
ones don't. This results in 12 HW threads.
As I said above, on such CPU I wouldn't believe the numbers before
total duration of test is 10 seconds and the test run is locked to
particular core. As to 5 msec per measurement, that's enough, but
why not do longer measurements if you have to run for 10 sec
anyway?
The Advent of Code task required exactly 250 keys and 250 locks to be
tested, this of course fits easily in a corner of $L1 (2000 bytes).
The input file to be parsed was 43*500 = 21500 bytes long, so this
should also fit in $L1 when I run repeated tests.
Under Windows I can set thread affinity to lock a process to a given
core, but how do I know which are "Big" and "Medium"?
Trial and error?
I think, big cores/threads tend to be with lower numbers, but I am not
sure it is universal.
Terje
In the mean time.
I did few measurements on Xeon E3 1271 v3. That is rather old uArch - Haswell, the first core that supports AVX2. During the tests it was
running at 4.0 GHz.
1. Original code (rewritten in plain C) compiled with clang -O3 -march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
AVX2 code compiled with clang -O3 -march=skylake (AVX2)
Results were as following (usec/call)
1 - 5.66
2 - 5.56
3 - 2.18
So, my measurements, similarly to your measurements, demonstrate that
clang autovectorized code looks good, but performs not too good.
Here is my manual code. Handling of the tail is too clever. I did not
have time to simplify. Otherwise, for 250x250 it should perform about
the same as simpler code.
#include <stdint.h>
#include <immintrin.h>
int foo_tst(const uint32_t* keylocks, int len, int li)
{
if (li >= len || li <= 0)
return 0;
const uint32_t* keyx = &keylocks[li];
unsigned ni = len - li;
__m256i res0 = _mm256_setzero_si256();
__m256i res1 = _mm256_setzero_si256();
__m256i res2 = _mm256_setzero_si256();
__m256i res3 = _mm256_setzero_si256();
const uint32_t* keyx_last = &keyx[ni & -32];
for (; keyx != keyx_last; keyx += 32) {
__m256i lock0 = _mm256_loadu_si256((const __m256i*)&keyx[0*8]);
__m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyx[1*8]);
__m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyx[2*8]);
__m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyx[3*8]);
// for (int k = 0; k < li; ++k) {
// for (int k = 0, nk = li; nk > 0; ++k, --nk) {
for (const uint32_t* keyy = keylocks; keyy != &keylocks[li];
++keyy) { // __m256i lockk =
_mm256_castps_si256(_mm256_broadcast_ss((const float*)&keylocks[k]));
__m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
float*)keyy)); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
_mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } int res = 0; if (ni % 32) { uint32_t
tmp[32]; const uint32_t* keyy_last = &keylocks[li & -32]; if (li % 32) {
for (int k = 0; k < li % 32; ++k)
tmp[k] = keyy_last[k];
for (int k = li % 32; k < 32; ++k)
tmp[k] = (uint32_t)-1;
}
const uint32_t* keyx_last = &keyx[ni % 32];
int nz = 0;
for (; keyx != keyx_last; keyx += 1) {
if (*keyx) {
__m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const float*)keyx)); for (const uint32_t* keyy = keylocks; keyy != keyy_last;
keyy += 32) { __m256i lock0 = _mm256_loadu_si256((const
__m256i*)&keyy[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyy[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyy[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyy[3*8]); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
_mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } if (li % 32) { __m256i lock0 = _mm256_loadu_si256((const __m256i*)&tmp[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&tmp[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&tmp[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&tmp[3*8]); res0 =
_mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } else { nz += 1; } } res = nz * li; }
// fold accumulators
res0 = _mm256_add_epi32(res0, res2);
res1 = _mm256_add_epi32(res1, res3);
res0 = _mm256_add_epi32(res0, res1);
res0 = _mm256_hadd_epi32(res0, res0);
res0 = _mm256_hadd_epi32(res0, res0);
res += _mm256_extract_epi32(res0, 0);
res += _mm256_extract_epi32(res0, 4);
return res;
}
Here is my manual code. Handling of the tail is too clever. I did not
have time to simplify. Otherwise, for 250x250 it should perform about
the same as simpler code.
#include <stdint.h>
#include <immintrin.h>
int foo_tst(const uint32_t* keylocks, int len, int li)
{
if (li >= len || li <= 0)
return 0;
const uint32_t* keyx = &keylocks[li];
unsigned ni = len - li;
__m256i res0 = _mm256_setzero_si256();
__m256i res1 = _mm256_setzero_si256();
__m256i res2 = _mm256_setzero_si256();
__m256i res3 = _mm256_setzero_si256();
const uint32_t* keyx_last = &keyx[ni & -32];
for (; keyx != keyx_last; keyx += 32) {
__m256i lock0 = _mm256_loadu_si256((const __m256i*)&keyx[0*8]);
__m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyx[1*8]);
__m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyx[2*8]);
__m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyx[3*8]);
// for (int k = 0; k < li; ++k) {
// for (int k = 0, nk = li; nk > 0; ++k, --nk) {
for (const uint32_t* keyy = keylocks; keyy != &keylocks[li];
++keyy) { // __m256i lockk =
_mm256_castps_si256(_mm256_broadcast_ss((const float*)&keylocks[k]));
__m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
float*)keyy)); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
_mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } int res = 0; if (ni % 32) { uint32_t
tmp[32]; const uint32_t* keyy_last = &keylocks[li & -32]; if (li % 32) {
for (int k = 0; k < li % 32; ++k)
tmp[k] = keyy_last[k];
for (int k = li % 32; k < 32; ++k)
tmp[k] = (uint32_t)-1;
}
const uint32_t* keyx_last = &keyx[ni % 32];
int nz = 0;
for (; keyx != keyx_last; keyx += 1) {
if (*keyx) {
__m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const float*)keyx)); for (const uint32_t* keyy = keylocks; keyy != keyy_last;
keyy += 32) { __m256i lock0 = _mm256_loadu_si256((const
__m256i*)&keyy[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyy[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyy[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyy[3*8]); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
_mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } if (li % 32) { __m256i lock0 = _mm256_loadu_si256((const __m256i*)&tmp[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&tmp[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&tmp[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&tmp[3*8]); res0 =
_mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } else { nz += 1; } } res = nz * li; }
// fold accumulators
res0 = _mm256_add_epi32(res0, res2);
res1 = _mm256_add_epi32(res1, res3);
res0 = _mm256_add_epi32(res0, res1);
res0 = _mm256_hadd_epi32(res0, res0);
res0 = _mm256_hadd_epi32(res0, res0);
res += _mm256_extract_epi32(res0, 0);
res += _mm256_extract_epi32(res0, 4);
return res;
}
On 2/13/2025 1:09 PM, Marcus wrote:-------------
The problem arises when the programmer *deliberately* does unaligned
loads and stores in order to improve performance. Or rather, if the
programmer knows that the hardware supports unaligned loads and stores,
he/she can use that to write faster code in some special cases.
Pretty much.
This is partly why I am in favor of potentially adding explicit keywords
for some of these cases, or to reiterate:
__aligned:
Inform compiler that a pointer is aligned.
May use a faster version if appropriate.
If a faster aligned-only variant exists of an instruction.
On an otherwise unaligned-safe target.
__unaligned: Inform compiler that an access is unaligned.
May use a runtime call or similar if necessary,
on an aligned-only target.
May do nothing on an unaligned-safe target.
None: Do whatever is the default.
Presumably, assume aligned by default,
unless target is known unaligned-safe.
Can note that the latency of carry-select adders is a little weird:
16/32/64: Latency goes up steadily;
But, still less than linear;
128-bit: Only slightly more latency than 64-bit.
The best I could find in past testing was seemingly 16-bit chunks for
normal adding. Where, 16-bits seemed to be around the break-even between
the chained CARRY4's and the Carry-Select (CS being slower below 16 bits).
But, for a 64-bit adder, still basically need to give it a clock-cycle
to do its thing. Though, not like 32 is particularly fast either; hence
part of the whole 2 cycle latency on ALU ops thing. Mostly has to do
with ADD/SUB (and CMP, which is based on SUB).
Admittedly part of why I have such mixed feelings on full
compare-and-branch:
Pro: It can offer a performance advantage (in terms of per-clock);
Con: Branch is now beholden to the latency of a Subtract.