• Re: Cost of handling misaligned access

    From Anton Ertl@21:1/5 to Thomas Koenig on Tue Feb 4 10:09:09 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    [fedora-starfive:/tmp:111378] cat x.c
    #include <string.h>

    long uload(long *p)
    {
    long x;
    memcpy(&x,p,sizeof(long));
    return x;
    }
    [fedora-starfive:/tmp:111379] gcc -O -S x.c
    [fedora-starfive:/tmp:111380] cat x.s
    .file "x.c"
    .option nopic
    .text
    .align 1
    .globl uload
    .type uload, @function
    uload:
    addi sp,sp,-16
    lbu t1,0(a0)

    [...]

    With RISC-V, nobody ever knows what architecture he is compiling for...

    The compiler knew very well that it was generating code for RV64GC,
    and I knew that too.

    And in this particular case the exact variant does not matter. RISC-V
    is generally specified to support unaligned accesses, see below.

    Did you tell gcc specifically that unsigned access was supported in
    the architecture you were using?

    What specific option would that be, and why would I want to look it up
    and tell that to gcc? After all, that's specified in the RISC-V
    specification. Even a very old one <https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-62.pdf>
    says:

    |The base ISA supports misaligned accesses

    It continues

    |but these might run extremely slowly depending on the implementation.

    but that does not change the architecture (and you referred to the architecture).

    It's also true of other architectures, and not just in theory, but
    also in practice <http://al.howardknight.net/?ID=143135464800> <https://www.complang.tuwien.ac.at/anton/unaligned-stores/>; should I
    check for such an option for every architecture?

    But even assuming that I want to generate code tuned for RISC-V
    implementations where unaligned accesses are implemented so slowly
    that I would prefer that code containing only aligned accesses is
    generated, I would expect a compiler for which the memcpy workaround
    is recommended (such as gcc) to do better, much better than gcc
    actually does, e.g., something along the lines of:

    uload:
    addi a5,a0,7
    andi a4,a0,-8
    andi a5,a5,-8
    ld a2,0(a5)
    ld a3,0(a4)
    neg a4,a0
    andi a4,a4,7
    andi a0,a0,7
    slliw a4,a4,3
    slliw a5,a0,3
    sll a5,a3,a5
    sra a0,a2,a4
    or a0,a0,a5
    ret

    Fewer instructions, and also a better distribution between various
    functional units.

    IIRC it's only three instructions on MIPS and five instructions on
    Alpha, but they have special instructions for this case, because they
    were designed for it, whereas RISC-V was designed to have unaligned
    accesses.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Tue Feb 4 11:26:31 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    But even assuming that I want to generate code tuned for RISC-V implementations where unaligned accesses are implemented so slowly
    that I would prefer that code containing only aligned accesses is
    generated, I would expect a compiler for which the memcpy workaround
    is recommended (such as gcc) to do better, much better than gcc
    actually does, e.g., something along the lines of:

    http://gcc.gnu.org/bugzilla is your friend.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Tue Feb 4 18:07:38 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    http://gcc.gnu.org/bugzilla is your friend.

    In my experience it's a waste of time:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93765 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93811

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Feb 4 18:58:43 2025
    On Tue, 4 Feb 2025 4:49:57 +0000, EricP wrote:

    MitchAlsup1 wrote:

    Basically, VAX taught us why we did not want to do "all that" in
    a single instruction; while Intel 432 taught us why we did not bit
    aligned decoders (and a lot of other things).

    I case people are interested...

    [paywalled]
    The Instruction Decoding Unit for the VLSI 432 General Data Processor,
    1981
    https://ieeexplore.ieee.org/abstract/document/1051633/

    The benchmarks in table 1(a) below tell it all:
    a 4 MHz 432 is 1/15 to 1/20 the speed (slower) than a 5 MHz VAX/780,
    1/4 to 1/7 speed than a 8 MHz 68000 or 5 MHz 8086

    A Performance Evaluation of The Intel iAPX 432, 1982 https://dl.acm.org/doi/pdf/10.1145/641542.641545

    And the reasons are covered here:

    Performance Effects of Architectural Complexity in the Intel 432, 1988 https://www.princeton.edu/~rblee/ELE572Papers/Fall04Readings/I432.pdf

    From the link::
    The 432’s procedure calls are quite costly. A typical procedure call
    requires 16 read accesses to memory and 24 write accesses, and it
    consumes 982 machine cycles. In terms of machine cycles, this makes
    it about ten times as slow as a call on the MC68010 or VAX 11/780.

    almost 1000 cycles just to call a subroutine !!!

    Lots of thinigs teh architects got wrong in there.....


    Bob Colwell, one of the authors of the third paper, later joined
    Intel as a senior architect and was involved in the development of the
    P6 core used in the Pentium Pro, Pentium II, and Pentium III
    microprocessors,
    and designs derived from it are used in the Pentium M, Core Duo and
    Core Solo, and Core 2.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Tue Feb 4 18:16:31 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    There are lots of potentially unaligned loads and stores. There are
    very few actually unaligned loads and stores: On Linux-Alpha every
    unaligned access is logged by default, and the number of
    unaligned-access entries in the logs of our machines was relatively
    small (on average a few per day). So trapping actual unaligned
    accesses was faster than replacing potential unaligned accesses with
    code sequences that synthesize the unaligned access from aligned
    accesses.

    Of course, if the cost of unaligned accesses is that high, you will
    avoid them in cases like block copies where cheap unaligned accesses
    would otherwise be beneficial.

    - anton

    That is fine for code that is being actively maintained and backward
    data structure compatibility is not required (like those inside a kernel).

    That is the experience on Linux-Alpha, which ran user-level code which
    had, for the most part, already been ported to, e.g., SPARC with
    trapping on actual unaligned access. These days, with basically all
    available hardware of the last decade supporting unaligned accesses,
    the experience might be different.

    However for x86 there was a few billion lines of legacy code that likely >assumed 2-byte alignment, or followed the fp64 aligned to 32-bits advice,

    That's not advice, that's the Intel IA-32 ABI. If you lay out your
    structures differently, they will not work with the libraries.

    and a C language that mandates structs be laid out in memory exactly as >specified (no automatic struct optimization).

    The C language mandates that the order of the fields is as specified,
    and that the same sequence of field types leads to the same layout,
    but otherwise does not mandate a layout. In particular, competent
    ABIs (i.e., not Intel's IA-32 ABI) mandate layouts that result in
    natural alignment of basic types.

    Also I seem to recall some
    amount of squawking about SIMD when it required naturally aligned buffers.

    SSE does not require natural alignment wrt. basic types, but the
    load-and-op instructions require 16-byte alignment. That's another
    idiocy on Intel's part. If you have

    for (i=0; i<n; i++)
    a[i] = b[i] + c[i];

    that's easy to vectorize if you have support for basic-type-aligned or unaligned accesses. But a, b, and c may all have different start
    addresses mod 16, so you cannot use Intel's 16-byte-aligned memory
    accesses for vectorizing that. Fortunately, they were not completely
    stupid and included unaligned-load and unaligned-store instructions,
    so if you use those, and forget about the load-and-operate
    instructions, SSE is useable.

    AMD has added a flag that turns off this Intel stupidity (if the flag
    is set, all SSE memory accesses support unaligned accesses), but Intel
    is stubborn and does not support this flag to this day; and they are
    the manufacturer that sells CPUs without AVX/AVX2 to this day (unlike
    AMD, which has supported AVX2 on all CPUs they sell for a long time).

    As SIMD no longer requires alignment, presumably code no longer does so.

    Yes, if you use AVX/AVX2, you don't encounter this particular Intel
    stupidity.

    Also in going from 32 to 64 bits, data structures that contain pointers
    now could find those 8-byte pointers aligned on 4-byte boundaries.

    What you write does not make sense. RAM data structures are laid out
    according to the ABI, which is different for different architectures,
    and typically requires natural alignment for basic data types; no
    unaligned accesses from backwards compatibility here. Wire or on-disk
    data structures are laid out according to the specification of the
    protocol or file system, which may include basic data types that are
    not aligned according to natural alignment (e.g., because there is a
    prefix on the wire); these do not contain pointers, and even if they
    contain some kind of reference (e.g., block numbers or inode numbers),
    the sizes are fixed across architectures.

    While the Linux kernel may not use many misaligned values,
    I'd guess there is a lot of application code that does.

    The reports about unaligned accesses in the logs were associated with user-level code (I dimly remember gs occuring in the log), not kernel
    code.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Scott Lurndal on Tue Feb 4 11:34:43 2025
    On 2/4/2025 11:25 AM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 4 Feb 2025 4:49:57 +0000, EricP wrote:

    MitchAlsup1 wrote:

    Basically, VAX taught us why we did not want to do "all that" in
    a single instruction; while Intel 432 taught us why we did not bit
    aligned decoders (and a lot of other things).

    I case people are interested...

    [paywalled]
    The Instruction Decoding Unit for the VLSI 432 General Data Processor,
    1981
    https://ieeexplore.ieee.org/abstract/document/1051633/

    The benchmarks in table 1(a) below tell it all:
    a 4 MHz 432 is 1/15 to 1/20 the speed (slower) than a 5 MHz VAX/780,
    1/4 to 1/7 speed than a 8 MHz 68000 or 5 MHz 8086

    A Performance Evaluation of The Intel iAPX 432, 1982
    https://dl.acm.org/doi/pdf/10.1145/641542.641545

    And the reasons are covered here:

    Performance Effects of Architectural Complexity in the Intel 432, 1988
    https://www.princeton.edu/~rblee/ELE572Papers/Fall04Readings/I432.pdf

    From the link::
    The 432’s procedure calls are quite costly. A typical procedure call
    requires 16 read accesses to memory and 24 write accesses, and it
    consumes 982 machine cycles. In terms of machine cycles, this makes
    it about ten times as slow as a call on the MC68010 or VAX 11/780.

    almost 1000 cycles just to call a subroutine !!!

    Lots of thinigs teh architects got wrong in there.....

    While true, it's easy to say in retrospect after forty+
    years of advancements in silicon design and technology.

    Comparing to the CISC architectures of the 60s and 70s,
    it's not horrible.

    Well, of course it depends upon what exactly the 432's Call instruction
    did. For the two 1960s-70s architectures I am most familiar with,
    transferring control to another instruction and saving the return
    address took only a small single digit number of cycles.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Feb 4 19:25:15 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 4 Feb 2025 4:49:57 +0000, EricP wrote:

    MitchAlsup1 wrote:

    Basically, VAX taught us why we did not want to do "all that" in
    a single instruction; while Intel 432 taught us why we did not bit
    aligned decoders (and a lot of other things).

    I case people are interested...

    [paywalled]
    The Instruction Decoding Unit for the VLSI 432 General Data Processor,
    1981
    https://ieeexplore.ieee.org/abstract/document/1051633/

    The benchmarks in table 1(a) below tell it all:
    a 4 MHz 432 is 1/15 to 1/20 the speed (slower) than a 5 MHz VAX/780,
    1/4 to 1/7 speed than a 8 MHz 68000 or 5 MHz 8086

    A Performance Evaluation of The Intel iAPX 432, 1982
    https://dl.acm.org/doi/pdf/10.1145/641542.641545

    And the reasons are covered here:

    Performance Effects of Architectural Complexity in the Intel 432, 1988
    https://www.princeton.edu/~rblee/ELE572Papers/Fall04Readings/I432.pdf

    From the link::
    The 432’s procedure calls are quite costly. A typical procedure call >requires 16 read accesses to memory and 24 write accesses, and it
    consumes 982 machine cycles. In terms of machine cycles, this makes
    it about ten times as slow as a call on the MC68010 or VAX 11/780.

    almost 1000 cycles just to call a subroutine !!!

    Lots of thinigs teh architects got wrong in there.....

    While true, it's easy to say in retrospect after forty+
    years of advancements in silicon design and technology.

    Comparing to the CISC architectures of the 60s and 70s,
    it's not horrible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Tue Feb 4 20:02:05 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    http://gcc.gnu.org/bugzilla is your friend.

    In my experience it's a waste of time:

    Well, if you think your time is better spent griping on USENET than
    to actually trying to accomplish something useful, that's your call.

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285

    A disagreement, I've had a few of those.

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93765

    That is stuck in WAITING.

    You know that, as the submitter, you can set the bug to NEW?
    You can also ping bugs, to bet them back on the radar.
    Or you can put one the Power maintainers in CC, for example
    Segher.

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93811

    Segher is in CC, you might want to ping particular PR.

    But yes, sometimes bugs don't get fixed for a looong time.
    I can bet you for older bugs; of the 112 bugs that I submitted
    that are still open, the oldest one is from April 2005
    (PR 21046). Do I stop submitting bugs? No, the most
    recent one is PR 118743, from today.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Feb 4 20:48:55 2025
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    mitchalsup@aol.com (MitchAlsup1) writes:

    almost 1000 cycles just to call a subroutine !!!

    Lots of thinigs teh architects got wrong in there.....

    While true, it's easy to say in retrospect after forty+
    years of advancements in silicon design and technology.

    Comparing to the CISC architectures of the 60s and 70s,
    it's not horrible.

    Was there any other CISC architecture that even came close?
    Almot 1000 cycles really sounds excessive.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to All on Tue Feb 4 20:45:00 2025
    In article <vnrrmg$2adb$1@gal.iecc.com>, johnl@taugh.com (John Levine)
    wrote:
    According to MitchAlsup1 <mitchalsup@aol.com>:
    while Intel 432 taught us why we did not bit
    aligned decoders (and a lot of other things).
    It was certainly an interesting experiment in yet another way that
    Intel wanted programmers to use their computers and the programmers
    said, naah.

    It didn't get that far. There were no low-cost i432 systems, so the
    ingenious software developers of the early 1980s carried on using more conventional microprocessors.

    The DoD wanted ADA, but the new software companies of the period weren't especially interested in selling to them. Making money in the civilian
    business software and games markets was far easier and more fun.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Feb 4 21:58:05 2025
    On Tue, 4 Feb 2025 10:09:09 +0000, Anton Ertl wrote:

    something along the lines of:

    uload:
    // addi a5,a0,7 // unnecessary
    andi a4,a0,-8
    // andi a5,a5,-8 // unnecessary
    ld a2,8(a4) // load higher
    ld a3,0(a4) // load lower
    neg a4,a0
    andi a4,a4,7
    andi a0,a0,7
    slliw a4,a4,3
    slliw a5,a0,3
    sll a5,a3,a5
    sra a0,a2,a4
    or a0,a0,a5
    ret

    uloadD:
    and r2,[r1],#-7
    ldd r3,[r2,0]
    ldd r4,[r2,8]
    srl r5,r1,<3,0> // r1[2..0]
    sll r5,r5,#3
    carry r3,{{o}{i}}
    sll r3,r3,r5
    sll r1,r4,r5
    ret

    Fewer instructions, and also a better distribution between various
    functional units.

    IIRC it's only three instructions on MIPS and five instructions on
    Alpha, but they have special instructions for this case, because they
    were designed for it, whereas RISC-V was designed to have unaligned
    accesses.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Feb 4 22:17:33 2025
    On Tue, 4 Feb 2025 20:49:14 +0000, BGB wrote:

    On 2/4/2025 1:25 PM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    -------------------
    Comparing to the CISC architectures of the 60s and 70s,
    it's not horrible.


    Well, vs a modern RISC style ISA, say, caller side:
    MOV R20, R10 //0c (SSC with following)
    MOV R21, R11 //1c
    BSR func //2c (typically)
    Cost: 3 cycles.

    MOV R1,R30
    MOV R2,R28
    CALL func

    3 instructions, might be 1 cycle on a 3-wide machine. And when
    BRS/CALL is visible at FETCH 2 cycles before it DECODEs, the
    call overhead is 0 cycles.

    func:
    ADD SP, -32, SP //2c (1 c penalty)
    MOV.Q LR, (SP, 24) //1c
    MOV.X R18, (SP, 0) //1c
    ...
    MOV.Q (SP, 24), LR //2c (1c penalty)
    MOV.X (SP, 0), R18 //1c
    JMP LR //10c (*1)

    *1: Insufficient delay since LR reload, so branch predictor fails to
    handle this case.

    This should be call/return predicted "just fine".
    It should not be indirect predictor predicted.

    Cost: 16 cycles.

    func:
    ENTER R30,R1,#32
    ...
    EXIT R30,R1,#32

    9 instructions on your machine, 5 on mine; also note: my ISA loads
    the return address directly into IP so FETCH can begin while the
    other LDs are in progress:: So, for the same amount of work, it
    would take only 3 cycles (with a bunch of caveats).

    But in any event, these are down about as low as one can expect
    whereas 432 is close to 1000 cycles, we all complained about VAX
    when it was in the 20-30 cycle range of overhead.

    as to why:: 432 changed the capabilities maps at call and return,
    and since these were not cached,... caller cannot see some of the
    capabilities called has access to, and vice versa. With a lot bet-
    ter caching of capabilities and modern bus widths, 432 might only
    be in the 40-50 cycle range of overhead.

    Moral:: Do not do way more work than required.

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Dallman on Wed Feb 5 16:42:44 2025
    On Tue, 4 Feb 2025 20:45 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <vnrrmg$2adb$1@gal.iecc.com>, johnl@taugh.com (John Levine)
    wrote:
    According to MitchAlsup1 <mitchalsup@aol.com>:
    while Intel 432 taught us why we did not bit
    aligned decoders (and a lot of other things).
    It was certainly an interesting experiment in yet another way that
    Intel wanted programmers to use their computers and the programmers
    said, naah.

    It didn't get that far. There were no low-cost i432 systems, so the
    ingenious software developers of the early 1980s carried on using more conventional microprocessors.


    Do you mean that there were high-cost i432 systems?
    I can't find anything in Wikipedia, but would guess that all programmers/organizations that had access to i432 hardware, did not pay
    money for it.
    Not dissimilar to Merced 17-18 years later except that number of the
    systems that was given away in early 80s was probably 3 orders of
    magnitude lower than in late 90s.
    Just speculating...

    The DoD wanted ADA, but the new software companies of the period
    weren't especially interested in selling to them. Making money in the civilian business software and games markets was far easier and more
    fun.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Wed Feb 5 18:10:03 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    As SIMD no longer requires alignment, presumably code no longer does so.

    Yes, if you use AVX/AVX2, you don't encounter this particular Intel stupidity.

    Recently, on the last day (Dec 25th) of Advent of Code, I had a problem
    which lent itself to using 32-bit bitmaps: The task was to check which
    locks were compatible with which keys, so I ended up with code like this:


    let mut part1 = 0;
    for l in li..keylocks.len() {
    let lock = keylocks[l];
    for k in 0..li {
    let sum = lock & keylocks[k];
    if sum == 0 {
    part1 += 1;
    }
    }
    }

    Telling the rust compiler to target my AVX2-capable laptop CPU (an Intel
    i7), I got code that simply amazed me: The compiler unrolled the inner
    loop by 32, ANDing 4 x 8 keys by 8 copies of the current lock into 4 AVX registers (vpand), then comparing with a zeroed register (vpcmpeqd)
    (generating -1/0 results) before subtracting (vpsubd) those from 4 accumulators.

    This resulted in just 12 instructions to handle 32 tests.

    The final code, with zero unsafe/asm/intrinsics, took 5.8 microseconds
    to run all the needed parsing/setup/initialization and then test 62500 combinations, so just 93 ps per key/lock test!

    There was no attempt to check for 32-byte algnment, it all just worked. :-)

    The task is of course embarrassingly parallelizable, but I suspect the
    overhead of starting 4 or 8 threads will be higher than what I would
    save? I guess I'll have to test!

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Wed Feb 5 17:48:30 2025
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    for k in 0..li {
    let sum = lock & keylocks[k];
    if sum == 0 {
    part1 += 1;
    }
    }

    Does Rust only have this roundabout way to express this sequentially?
    In Forth I would express that scalarly as

    ( part1 ) li 0 do
    keylocks i th @ lock and 0= - loop

    ["-" because 0= produces all-bits-set (-1) for true]

    or in C as

    for (k=0; k<li; k++)
    part1 += (lock & keylocks[k])==0;

    which I find much easier to follow. I also expected 0..li to include
    li (based on, I guess, the of .. in Pascal and its descendents), but
    the net tells me that it does not (starting with 0 was the hint that
    made me check my expectations).

    Telling the rust compiler to target my AVX2-capable laptop CPU (an Intel
    i7)

    I find it deplorable that even knowledgeable people use marketing
    labels like "i7" which do not tell anything technical (and very little non-technical) rather than specifying the full model number (e.g, Core i7-1270P) or the design (e.g., Alder Lake). But in the present case "AVX2-capable CPU" is enough information.

    I got code that simply amazed me: The compiler unrolled the inner
    loop by 32, ANDing 4 x 8 keys by 8 copies of the current lock into 4 AVX >registers (vpand), then comparing with a zeroed register (vpcmpeqd) >(generating -1/0 results) before subtracting (vpsubd) those from 4 >accumulators.

    If you have ever learned about vectorization, it's easy to see that
    the inner loop can be vectorized. And obviously auto-vectorization
    has worked in this case, not particularly amazing to me.

    But if you have learned about vectorization, you will find that you
    will see ways to vectorize code, but that many programming languages
    don't offer ways to express the vectorization directly. Instead, you
    write the code as scalar code and hope that the auto-vectorizer
    actually vectorizes it. If it does not, there is no indication how
    you can get the compiler to auto-vectorize.

    Even for Fortran, where the array sublanguage has vector semantics
    within expressions (maybe somebody can show code for the example
    above), Thomas Koenig tells us that his gcc front end produces scalar
    IR code from that and then relies on auto-vectorization to undo the scalarization.

    There was no attempt to check for 32-byte algnment, it all just worked. :-)

    When I try this stuff with gcc and it actually succeeds at
    auto-vectorization, the result tends to be very long, and it's also
    the case here:

    For:

    unsigned long inner(unsigned long li, unsigned lock, unsigned keylocks[], unsigned long part1)
    {
    unsigned long k;
    for (k=0; k<li; k++)
    part1 += (lock & keylocks[k])==0;
    return part1;
    }

    gcc -Wall -O3 -mavx2 -c x.c && objdump -d x.o

    produces 109 lines of disassembly output (which I will spare you),
    with a total length of 394 bytes. When I ask for AVX-512 with

    gcc -Wall -O3 -march=x86-64-v4 -c x.c && objdump -d x.o

    it's even worse: 139 lines and 538 bytes. My impression is that gcc
    tries to align the main loop to 32-byte (for AVX2) or 64-byte
    boundaries and generates lots of code around the main loop in order to
    get there.

    Which somewhat leads us back to the topic of the thread. I wonder if
    the alignment really helps for this loop, if so, how much, and how
    many iterations are necessary to amortize the overhead. But I am too
    lazy to measure it.

    clang is somewhat better:

    For the avx2 case, 70 lines and 250 bytes.
    For the x86-64-v4 case, 111 lines and 435 byes.

    The versions used are gcc-12.2.0 and clang-14.0.6.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Wed Feb 5 20:26:18 2025
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    for k in 0..li {
    let sum = lock & keylocks[k];
    if sum == 0 {
    part1 += 1;
    }
    }

    Does Rust only have this roundabout way to express this sequentially?
    In Forth I would express that scalarly as

    ( part1 ) li 0 do
    keylocks i th @ lock and 0= - loop

    ["-" because 0= produces all-bits-set (-1) for true]

    or in C as

    for (k=0; k<li; k++)
    part1 += (lock & keylocks[k])==0;

    I could have written it as
    part1 += ((lock & keylocks[k]) == 0) as u32;

    I.e just like C except all casting has to be explicit, and here the
    boolean result of the '==' test needs to be expanded into a u32.


    which I find much easier to follow. I also expected 0..li to include
    li (based on, I guess, the of .. in Pascal and its descendents), but
    the net tells me that it does not (starting with 0 was the hint that
    made me check my expectations).

    :-)

    It is similar to "for (k=0;k<li;k++) {}" so exclusive right limit feels natural.


    Telling the rust compiler to target my AVX2-capable laptop CPU (an Intel
    i7)

    I find it deplorable that even knowledgeable people use marketing
    labels like "i7" which do not tell anything technical (and very little non-technical) rather than specifying the full model number (e.g, Core i7-1270P) or the design (e.g., Alder Lake). But in the present case "AVX2-capable CPU" is enough information.

    I got code that simply amazed me: The compiler unrolled the inner
    loop by 32, ANDing 4 x 8 keys by 8 copies of the current lock into 4 AVX
    registers (vpand), then comparing with a zeroed register (vpcmpeqd)
    (generating -1/0 results) before subtracting (vpsubd) those from 4
    accumulators.

    If you have ever learned about vectorization, it's easy to see that
    the inner loop can be vectorized. And obviously auto-vectorization
    has worked in this case, not particularly amazing to me.

    I have some (30 years?) experience with auto-vectorization, usually I've
    been (very?) disappointed. As I wrote this was the best I have ever
    seen, and the resulting code actually performed extremely close to
    theoretical speed of light, i.e. 3 clock cycles for each 3 avx instruction.

    [snip]

    clang is somewhat better:

    For the avx2 case, 70 lines and 250 bytes.
    For the x86-64-v4 case, 111 lines and 435 byes.

    Rustc sits on top of the clang infrastucture, even with that 32-way
    unroll it was quite compact. I did not count, but your 70 lines seems to
    be in the ballpark.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Michael S on Wed Feb 5 20:08:00 2025
    In article <20250205164244.00004d42@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

    Do you mean that there were high-cost i432 systems?

    Not that I know of.

    Not dissimilar to Merced 17-18 years later except that number of the
    systems that was given away in early 80s was probably 3 orders of
    magnitude lower than in late 90s.

    About 15,000 Merced systems got given away, so that's plausible.


    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Thu Feb 6 00:01:43 2025
    On Wed, 5 Feb 2025 18:10:03 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    As SIMD no longer requires alignment, presumably code no longer
    does so.

    Yes, if you use AVX/AVX2, you don't encounter this particular Intel stupidity.

    Recently, on the last day (Dec 25th) of Advent of Code, I had a
    problem which lent itself to using 32-bit bitmaps: The task was to
    check which locks were compatible with which keys, so I ended up with
    code like this:


    let mut part1 = 0;
    for l in li..keylocks.len() {
    let lock = keylocks[l];
    for k in 0..li {
    let sum = lock & keylocks[k];
    if sum == 0 {
    part1 += 1;
    }
    }
    }

    Telling the rust compiler to target my AVX2-capable laptop CPU (an
    Intel i7), I got code that simply amazed me: The compiler unrolled
    the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
    lock into 4 AVX registers (vpand), then comparing with a zeroed
    register (vpcmpeqd) (generating -1/0 results) before subtracting
    (vpsubd) those from 4 accumulators.

    This resulted in just 12 instructions to handle 32 tests.


    That sounds suboptimal.
    By unrolling outer loop by 2 or 3 you can greatly reduce the number of
    memory accesses per comparison. The speed up would depend on specific microarchiture, but I would guess that at least 1.2x speedup is here. Especially so when data is not aligned.

    The final code, with zero unsafe/asm/intrinsics, took 5.8
    microseconds to run all the needed parsing/setup/initialization and
    then test 62500 combinations, so just 93 ps per key/lock test!

    There was no attempt to check for 32-byte algnment, it all just
    worked. :-)

    The task is of course embarrassingly parallelizable, but I suspect
    the overhead of starting 4 or 8 threads will be higher than what I
    would save? I guess I'll have to test!

    Terje



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to EricP on Wed Feb 5 23:36:53 2025
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    While the Linux kernel may not use many misaligned values,
    I'd guess there is a lot of application code that does.

    I guess that much of that is simply "by accident" because
    without alignment checks in hadware misalignemnt may happen
    and nobody notices that there is small performance problem.

    I worked on a low level program and reasonably recent I did get
    bunch of alignment errors. On AMD64 they were due to SSE
    instructions used by 'memcpy', on 32-bit ARM due to use of double
    precision floating point in 'memcpy'. It took some time to find
    them, simply most things worked even without alignment and the
    offending cases were hard to trigger.

    My personal feeling is that best machine would have aligned
    access with checks by default, but also special instructions
    for unaligned access. That way code that does not need
    unaligned access gets extra error checking, while code that
    uses unaligned access pays modest, essentially unavoidable
    penalty.

    Of course, once architecture officially supports unaligned
    access, there will be binaries depending on this and backward
    compatibility will prevent change to require alignment.

    Concerning SIMD: trouble here is increasing vector length and
    consequently increasing alignment requirements. A lot of SIMD
    code is memory-bound and current way of doing misaligned
    access leads to worse performance. So really no good way
    to solve this. In principle set of buffers for 2 cache lines
    each and appropriate shifters could give optimal troughput,
    but probably would lead to increased latency.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Thu Feb 6 08:52:42 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    For:

    unsigned long inner(unsigned long li, unsigned lock, unsigned keylocks[], unsigned long part1)
    {
    unsigned long k;
    for (k=0; k<li; k++)
    part1 += (lock & keylocks[k])==0;
    return part1;
    }

    gcc -Wall -O3 -mavx2 -c x.c && objdump -d x.o

    produces 109 lines of disassembly output (which I will spare you),
    with a total length of 394 bytes.
    ...
    clang is somewhat better:

    For the avx2 case, 70 lines and 250 bytes.

    I have now taken a closer look at the generated code. Let's first
    consider the clang code:

    The inner loop works on 16 elements of keylocks[] per iteration, 4 per
    SIMD instruction. It uses AVX-128 instructions (with 32-bit elements)
    for the (lock & keylocks[k])==0 part, then converts that to 64-bit
    elements and continues with AVX-256 instructions for the part1+= part;
    it does that by zero-extending the elements, anding them with 1, and
    adding them to the accumulators. It would have saved instructions to sign-extend the elements and subtracting them from the accumulators.

    Before the inner loop there is just setup, with no attempt to align
    the input. After the inner loop, the results in the 4 ymm registers
    are summed up (with maximal latency), then the 4 results inside the
    ymm register are summed up, and finally a scalar loop counts the
    remaining items.

    Now gcc:

    The inner loop works on 8 elements of keylocks[] per iteration,
    initially with 8 32-bit elements per SIMD instruction (AVX-256), later
    with 4 64-bit elements per SIMD instruction. It ands, then
    zero-extends (so it needs only one and instead of two for anding after
    the extension), ands and adds, it then adds the two result ymm
    register to one ymm accumulator (in a latency-minimizing way).

    Again, there is not alignment code before the inner loop, just setup.
    The elements of the ymm accumulator are summed up into one result.
    Then, if there are 4 elements or more remaining, it performs the setup
    and SIMD code for working on 4 elements with AVX-128 code, including
    after the width expansion; so it uses two AVX-128 instructions for
    performing the zero-extension rather thanm one AVX-256 instruction.
    It also sets up all the constants in the xmm registers (again, in case
    of coming from the inner loop); it finally sums up the result and adds
    it to the result of the inner loop. Finally, the last 0-3 iterations
    are performed in a scalar loop.


    The gcc-12 code is more sophisticated in several respects than the
    clang-14 code, but also has several obvious ways to improve it: it
    could use sign-extension and subtraction (clang, too); the handling of
    the final 4-7 items in gcc could work with 4-element-SIMD throughout
    and reuse the contents of the registers that hold constants, and the
    result could be added to the SIMD accumulator, summing that up only
    afterwards. Maybe the gcc maintainers have performed some of these improvements in the meantime.


    Given the description of Terje Mathisen, my guess is that part1 and li
    have 32-bit types, so the C equivalent would be:

    unsigned inner(unsigned li, unsigned lock, unsigned keylocks[], unsigned part1) {
    unsigned k;
    for (k=0; k<li; k++)
    part1 += (lock & keylocks[k])==0;
    return part1;
    }

    With that I see shorter code from both gcc and clang, unsurprising
    given the complications that fall away.


    I looked also at the code that the compilers produce with
    -march=x86-64-v4 (which includes AVX-512) for the latter function.
    clang generates code that does not use (512-bit) zmm registers not the
    k (predicate) registers. It is different from and longer than the
    -mavx2 code, though.

    gcc uses both zmm and k registers. The inner loop is quite small:

    40: 62 f1 65 48 db 08 vpandd (%rax),%zmm3,%zmm1
    46: 48 83 c0 40 add $0x40,%rax
    4a: 62 f2 76 48 27 c9 vptestnmd %zmm1,%zmm1,%k1
    50: 62 f1 7d c9 6f ca vmovdqa32 %zmm2,%zmm1{%k1}{z}
    56: 62 f1 7d 48 fa c1 vpsubd %zmm1,%zmm0,%zmm0
    5c: 48 39 c2 cmp %rax,%rdx
    5f: 75 df jne 40 <inner+0x40>

    It works on 16 elements per iteration. It uses vptestnmd and
    vmovdqa32 to set zmm1 to -1 (zmm2 is set to -1 in the setup) or 0
    (using the {z} option); it's unclear to me why it does not use
    vpcmpeqd, which would have produced the same result in one
    instruction. Without the intervening extension, the compiler is able
    to use a subtration of -1.

    The inner loop is preceded by setup (again, no alignment). It is
    followed by summing the results and adding them to a 32-bit
    accumulator, than an optional use of AVX-256 for 8 elements (and
    summing the results and adding them to the accumulator, and finally by
    a completely unrolled scalar loop for the last 0-7 elements.

    What I would try is to use predication and AVX-512 for the last 1-15
    (or maybe 1-16) elements, and only then sum up the results and add it
    to part1. I wonder why the gcc-12 developers did not make use of that possibility (and looking at gcc-14.2 with godbolt, the code for that
    apparently is still pretty similar; the clang-19 code has become
    shorter, but still does not use zmm or k registers).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Thu Feb 6 10:30:49 2025
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    If you have ever learned about vectorization, it's easy to see that
    the inner loop can be vectorized. And obviously auto-vectorization
    has worked in this case, not particularly amazing to me.

    I have some (30 years?) experience with auto-vectorization, usually I've
    been (very?) disappointed.

    I have often been disappointed, too.

    As I wrote this was the best I have ever
    seen, and the resulting code actually performed extremely close to >theoretical speed of light, i.e. 3 clock cycles for each 3 avx instruction.

    What theory is behind that? If we take the inner loop (from my
    version that uses unsigned, not unsigned long) by clang-14.0.6 -O3
    -mavx2:

    50: c5 f5 db 34 0a vpand (%rdx,%rcx,1),%ymm1,%ymm6
    55: c5 f5 db 7c 0a 20 vpand 0x20(%rdx,%rcx,1),%ymm1,%ymm7
    5b: c5 75 db 44 0a 40 vpand 0x40(%rdx,%rcx,1),%ymm1,%ymm8
    61: c5 75 db 4c 0a 60 vpand 0x60(%rdx,%rcx,1),%ymm1,%ymm9
    67: c5 cd 76 f2 vpcmpeqd %ymm2,%ymm6,%ymm6
    6b: c5 fd fa c6 vpsubd %ymm6,%ymm0,%ymm0
    6f: c5 c5 76 f2 vpcmpeqd %ymm2,%ymm7,%ymm6
    73: c5 e5 fa de vpsubd %ymm6,%ymm3,%ymm3
    77: c5 bd 76 f2 vpcmpeqd %ymm2,%ymm8,%ymm6
    7b: c5 dd fa e6 vpsubd %ymm6,%ymm4,%ymm4
    7f: c5 b5 76 f2 vpcmpeqd %ymm2,%ymm9,%ymm6
    83: c5 d5 fa ee vpsubd %ymm6,%ymm5,%ymm5
    87: 48 83 e9 80 sub $0xffffffffffffff80,%rcx
    8b: 48 39 c8 cmp %rcx,%rax
    8e: 75 c0 jne 50 <inner+0x50>

    I see that clang uses 4 ymm accumulators (ymm0, ymm3, ymm4, ymm5), so
    the recurrences are only 1-cycle recurrences for the 4 vsubd
    instructions and the sub instruction. So, with enough resources, a
    CPU core could perform 1 iteration per cycle (and with hardware
    reassociation, even faster, but apart from adding constants in Alder
    Lake ff., we are not there yet). But current CPUs do not have that
    many resources. If we want to determine the maximum speed given
    resource limits, we have to look at the concrete CPU model.

    BTW, an alternative would be to do some summation already in each
    iteration, but with still only a one-cycle recurrence. This could
    have looked like this:

    vpand (%rdx,%rcx,1),%ymm1,%ymm6
    vpand 0x20(%rdx,%rcx,1),%ymm1,%ymm7
    vpand 0x40(%rdx,%rcx,1),%ymm1,%ymm8
    vpand 0x60(%rdx,%rcx,1),%ymm1,%ymm9
    vpcmpeqd %ymm2,%ymm6,%ymm6
    vpcmpeqd %ymm2,%ymm7,%ymm7
    vpaddd %ymm6,%ymm7,%ymm7
    vpcmpeqd %ymm2,%ymm8,%ymm8
    vpcmpeqd %ymm2,%ymm9,%ymm9
    vpaddd %ymm8,%ymm9,%ymm9
    vpaddd %ymm7,%ymm9,%ymm9
    vpsubd %ymm0,%ymm9,%ymm0
    sub $0xffffffffffffff80,%rcx
    cmp %rcx,%rax
    jne 50 <inner+0x50>

    Here the SIMD recurrence uses ymm0. This saves having to perform the
    summing up of SIMD accumulators after the inner loop. gcc uses
    something like that in one of the codes I looked at.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Thu Feb 6 10:59:39 2025
    Michael S <already5chosen@yahoo.com> writes:
    This resulted in just 12 instructions to handle 32 tests.


    That sounds suboptimal.
    By unrolling outer loop by 2 or 3 you can greatly reduce the number of
    memory accesses per comparison.

    Looking at the inner loop code shown in <2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do
    not include the loop overhead and are already unrolled by a factor of
    4 (32 for the scalar code). The loop overhead is 3 instructions, for
    a total of 15 instructions per iteration.

    The speed up would depend on specific
    microarchiture, but I would guess that at least 1.2x speedup is here.

    Even if you completely eliminate the loop overhead, the number of
    instructions is reduced by at most a factor 1.25, and I expect that
    the speedup from further unrolling is a factor of at most 1 on most
    CPUs (factor <1 can come from handling the remaining elements slowly,
    which does not seem unlikely for code coming out of gcc and clang).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Thu Feb 6 11:21:24 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Concerning SIMD: trouble here is increasing vector length and
    consequently increasing alignment requirements.

    That is not a necessary consequence, on the contrary: alignment
    requirements based on SIMD granularity is hardware designer lazyness,
    but means that SIMD cannot be used for many of the applications where
    SIMD without that limitation can be used.

    If you want to have alignment checks, then a SIMD instruction should
    check for element alignment, not for SIMD alignment.

    But the computer architecture trend is clear: General-purpose
    computers do not have alignment restrictions; all that had them have
    been discontinued; the last one that had them was SPARC.

    A lot of SIMD
    code is memory-bound and current way of doing misaligned
    access leads to worse performance. So really no good way
    to solve this. In principle set of buffers for 2 cache lines
    each and appropriate shifters could give optimal troughput,
    but probably would lead to increased latency.

    AFAIK that's what current microarchitectures do, and in many cases
    with small penalties for unaligned accesses; see https://www.complang.tuwien.ac.at/anton/unaligned-stores/

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Waldek Hebisch on Thu Feb 6 13:47:56 2025
    Waldek Hebisch wrote:
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    While the Linux kernel may not use many misaligned values,
    I'd guess there is a lot of application code that does.

    I guess that much of that is simply "by accident" because
    without alignment checks in hadware misalignemnt may happen
    and nobody notices that there is small performance problem.

    I worked on a low level program and reasonably recent I did get
    bunch of alignment errors. On AMD64 they were due to SSE
    instructions used by 'memcpy', on 32-bit ARM due to use of double
    precision floating point in 'memcpy'. It took some time to find
    them, simply most things worked even without alignment and the
    offending cases were hard to trigger.

    My personal feeling is that best machine would have aligned
    access with checks by default, but also special instructions
    for unaligned access. That way code that does not need
    unaligned access gets extra error checking, while code that
    uses unaligned access pays modest, essentially unavoidable
    penalty.

    Of course, once architecture officially supports unaligned
    access, there will be binaries depending on this and backward
    compatibility will prevent change to require alignment.

    Concerning SIMD: trouble here is increasing vector length and
    consequently increasing alignment requirements. A lot of SIMD
    code is memory-bound and current way of doing misaligned
    access leads to worse performance. So really no good way
    to solve this. In principle set of buffers for 2 cache lines
    each and appropriate shifters could give optimal troughput,
    but probably would lead to increased latency.

    SIMD absolutely require, as a minimum, the ability to handle data that
    is only aligned according to the internal elements: An array of double
    can start on any address which is 0 mod 8, similar for float/u32 etc.
    This way you can go from 128 via 256 to 512 bit SIMD regs with no data alignment change.

    From this, and the need to also handle byte arrays, you end up with
    unaligned as the default. The less overhead to handle straddling inputs
    the better.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Thu Feb 6 13:57:12 2025
    Michael S wrote:
    On Wed, 5 Feb 2025 18:10:03 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    As SIMD no longer requires alignment, presumably code no longer
    does so.

    Yes, if you use AVX/AVX2, you don't encounter this particular Intel
    stupidity.

    Recently, on the last day (Dec 25th) of Advent of Code, I had a
    problem which lent itself to using 32-bit bitmaps: The task was to
    check which locks were compatible with which keys, so I ended up with
    code like this:


    let mut part1 = 0;
    for l in li..keylocks.len() {
    let lock = keylocks[l];
    for k in 0..li {
    let sum = lock & keylocks[k];
    if sum == 0 {
    part1 += 1;
    }
    }
    }

    Telling the rust compiler to target my AVX2-capable laptop CPU (an
    Intel i7), I got code that simply amazed me: The compiler unrolled
    the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
    lock into 4 AVX registers (vpand), then comparing with a zeroed
    register (vpcmpeqd) (generating -1/0 results) before subtracting
    (vpsubd) those from 4 accumulators.

    This resulted in just 12 instructions to handle 32 tests.


    That sounds suboptimal.
    By unrolling outer loop by 2 or 3 you can greatly reduce the number of
    memory accesses per comparison. The speed up would depend on specific microarchiture, but I would guess that at least 1.2x speedup is here. Especially so when data is not aligned.

    Anton already replied, as he wrote the total loop overhead is just three instructions, all of which can (& will?) overlap with the AVX instructions.

    Due to the combined AVX and 4x unroll, the original scalar code is
    alreayd unrolled 32 x, so the loop overhead can mostly be ignored.

    If the cpu has enough resources to run more than one 32-byte AVX
    instruction per cycle, then the same code will allow all four copies to
    run at the same time, but the timing I see on my laptop (93 ps)
    corresponds closely to one AVX op/cycle.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Thu Feb 6 15:44:38 2025
    On Thu, 6 Feb 2025 13:57:12 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Wed, 5 Feb 2025 18:10:03 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    As SIMD no longer requires alignment, presumably code no longer
    does so.

    Yes, if you use AVX/AVX2, you don't encounter this particular
    Intel stupidity.

    Recently, on the last day (Dec 25th) of Advent of Code, I had a
    problem which lent itself to using 32-bit bitmaps: The task was to
    check which locks were compatible with which keys, so I ended up
    with code like this:


    let mut part1 = 0;
    for l in li..keylocks.len() {
    let lock = keylocks[l];
    for k in 0..li {
    let sum = lock & keylocks[k];
    if sum == 0 {
    part1 += 1;
    }
    }
    }

    Telling the rust compiler to target my AVX2-capable laptop CPU (an
    Intel i7), I got code that simply amazed me: The compiler unrolled
    the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
    lock into 4 AVX registers (vpand), then comparing with a zeroed
    register (vpcmpeqd) (generating -1/0 results) before subtracting
    (vpsubd) those from 4 accumulators.

    This resulted in just 12 instructions to handle 32 tests.


    That sounds suboptimal.
    By unrolling outer loop by 2 or 3 you can greatly reduce the number
    of memory accesses per comparison. The speed up would depend on
    specific microarchiture, but I would guess that at least 1.2x
    speedup is here. Especially so when data is not aligned.

    Anton already replied, as he wrote the total loop overhead is just
    three instructions, all of which can (& will?) overlap with the AVX instructions.


    It's not about loop overhead. See below.

    Due to the combined AVX and 4x unroll, the original scalar code is
    alreayd unrolled 32 x, so the loop overhead can mostly be ignored.

    If the cpu has enough resources to run more than one 32-byte AVX
    instruction per cycle, then the same code will allow all four copies
    to run at the same time, but the timing I see on my laptop (93 ps) corresponds closely to one AVX op/cycle.

    Terje


    What CPU?
    I am not aware of any mainline AVX2-equipped Intel CPUs that can
    not run at least 2 256-bit integer ALU instructions simultaneously.
    For Ice Lake or later, they have 3 symmetric vector ALUs capable to
    execute any instructions in your mix.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Thu Feb 6 15:28:08 2025
    On Thu, 06 Feb 2025 10:59:39 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    This resulted in just 12 instructions to handle 32 tests.


    That sounds suboptimal.
    By unrolling outer loop by 2 or 3 you can greatly reduce the number
    of memory accesses per comparison.

    Looking at the inner loop code shown in <2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do
    not include the loop overhead and are already unrolled by a factor of
    4 (32 for the scalar code). The loop overhead is 3 instructions, for
    a total of 15 instructions per iteration.

    The speed up would depend on specific
    microarchiture, but I would guess that at least 1.2x speedup is
    here.

    Even if you completely eliminate the loop overhead, the number of instructions is reduced by at most a factor 1.25, and I expect that
    the speedup from further unrolling is a factor of at most 1 on most
    CPUs (factor <1 can come from handling the remaining elements slowly,
    which does not seem unlikely for code coming out of gcc and clang).

    - anton

    The point of my proposal is not reduction of loop overhead and not
    reduction of the # of x86 instructions (in fact, with my proposal the #
    of x86 instructions is increased), but reduction in # of uOps due to
    reuse of loaded values.
    The theory behind it is that most typically in code with very high
    IPC like the one above the main bottleneck is the # of uOps that flows
    through rename stage.

    Not counting loop overhead, an original 1x4 inner loop consists of 12 instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that
    does the same amount of work. New inner loop contains only RISC-like instructions - 14 instructions, 14 uOps.
    With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more
    work done per iteration.

    Another factor that can contribute to a speedup is increased number
    of iterations in the inner loop - from 1..7 iterations in original to
    1..15 in both of my above mentioned variants.


    Yet another possibility is to follow "work harder not smarter"
    principle, i.e. process the whole square rather than just a relevant
    triangle. The main gain is that loop detector would be able to predict
    the # of iterations in the inner loop, avoiding mispredicted branch at
    the end. If we follow this pass then it probably makes sense to
    not unroll an inner loop beyond SIMD factor of 8 and instead unroll an
    outer loop by 4.
    Going by intuition, in this particular application "smarter" wins
    over "harder", but we know that intuition sucks. Including mine :(

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Thu Feb 6 13:54:55 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 06 Feb 2025 10:59:39 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    This resulted in just 12 instructions to handle 32 tests.


    That sounds suboptimal.
    By unrolling outer loop by 2 or 3 you can greatly reduce the number
    of memory accesses per comparison.
    ...
    The point of my proposal is not reduction of loop overhead and not
    reduction of the # of x86 instructions (in fact, with my proposal the #
    of x86 instructions is increased), but reduction in # of uOps due to
    reuse of loaded values.
    The theory behind it is that most typically in code with very high
    IPC like the one above the main bottleneck is the # of uOps that flows >through rename stage.

    Not counting loop overhead, an original 1x4 inner loop consists of 12 >instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that
    does the same amount of work. New inner loop contains only RISC-like >instructions - 14 instructions, 14 uOps.
    With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more
    work done per iteration.

    I completely missed the "outer" in your response. Yes, looking at the
    original loop again:

    let mut part1 = 0;
    for l in li..keylocks.len() {
    let lock = keylocks[l];
    for k in 0..li {
    let sum = lock & keylocks[k];
    if sum == 0 {
    part1 += 1;
    }
    }
    }

    you can reuse the keylocks[k] value by unrolling the outer loop.
    E.g., if you unroll the outer loop by a factor of 4 (and the inner
    loop not beyond getting SIMD width), you can use almost the same code
    as clang produces, but you load keylocks[0..li] 4 times less often,
    and if the bottleneck for the inner-loop-only-optimized variant is
    bandwidth to the memory subsystem (it seems that Terje Mathisen worked
    with up to 62500 values, i.e., 250KB, i.e. L2), which is likely, there
    may be quite a bit of speedup.

    E.g., for Zen5 the bandwidth to L2 is reported to be 32 bytes/cycle,
    which would limit the performance to need at least 4 cycles/iteration
    (3.75 IPC), possibly less due to misalignment handling overhead, and
    using AVX-512 would not help, whereas with reusing the loaded value
    the limit would probably be resources, and AVX-512 would see quite a
    bit of speedup over AVX2.

    Another factor that can contribute to a speedup is increased number
    of iterations in the inner loop - from 1..7 iterations in original to
    1..15 in both of my above mentioned variants.

    Yes. I actually don't see a reason to unroll the inner loop more than
    needed for the SIMD instructions at hand, unless the number of
    outer-loop iterations is too small. If you want more unrolling,
    unroll the outer loop more.

    Yet another possibility is to follow "work harder not smarter"
    principle, i.e. process the whole square rather than just a relevant >triangle.

    I don't see a triangle in the code above. There may be some more
    outer loop involved that varies li from 0 to keylocks.len() or
    something, but the code that is presented processes a square.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Thu Feb 6 16:49:56 2025
    On Thu, 06 Feb 2025 13:54:55 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 06 Feb 2025 10:59:39 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    This resulted in just 12 instructions to handle 32 tests.


    That sounds suboptimal.
    By unrolling outer loop by 2 or 3 you can greatly reduce the
    number of memory accesses per comparison.
    ...
    The point of my proposal is not reduction of loop overhead and not >reduction of the # of x86 instructions (in fact, with my proposal
    the # of x86 instructions is increased), but reduction in # of uOps
    due to reuse of loaded values.
    The theory behind it is that most typically in code with very high
    IPC like the one above the main bottleneck is the # of uOps that
    flows through rename stage.

    Not counting loop overhead, an original 1x4 inner loop consists of 12 >instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that
    does the same amount of work. New inner loop contains only RISC-like >instructions - 14 instructions, 14 uOps.
    With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more
    work done per iteration.

    I completely missed the "outer" in your response. Yes, looking at the original loop again:

    let mut part1 = 0;
    for l in li..keylocks.len() {
    let lock = keylocks[l];
    for k in 0..li {
    let sum = lock & keylocks[k];
    if sum == 0 {
    part1 += 1;
    }
    }
    }

    you can reuse the keylocks[k] value by unrolling the outer loop.
    E.g., if you unroll the outer loop by a factor of 4 (and the inner
    loop not beyond getting SIMD width), you can use almost the same code
    as clang produces, but you load keylocks[0..li] 4 times less often,
    and if the bottleneck for the inner-loop-only-optimized variant is
    bandwidth to the memory subsystem (it seems that Terje Mathisen worked
    with up to 62500 values, i.e., 250KB, i.e. L2), which is likely, there
    may be quite a bit of speedup.


    My understanding is different. There are only 2x250 values = 2000 bytes.

    E.g., for Zen5 the bandwidth to L2 is reported to be 32 bytes/cycle,
    which would limit the performance to need at least 4 cycles/iteration
    (3.75 IPC), possibly less due to misalignment handling overhead, and
    using AVX-512 would not help, whereas with reusing the loaded value
    the limit would probably be resources, and AVX-512 would see quite a
    bit of speedup over AVX2.

    Another factor that can contribute to a speedup is increased number
    of iterations in the inner loop - from 1..7 iterations in original to
    1..15 in both of my above mentioned variants.

    Yes. I actually don't see a reason to unroll the inner loop more than
    needed for the SIMD instructions at hand, unless the number of
    outer-loop iterations is too small. If you want more unrolling,
    unroll the outer loop more.


    Now, when I think about it, may be it makes sense to not unroll inner
    loop at all, even by SIMD? Do all unrolling on the outer side? I.e.
    process 32 iterations of outer loop at once. 32 "outer" locks hold in 4
    SIMD registers, inner loop loads values with 'vbroadcastss ymm, m32'.
    That would completely eliminates any alignment concerns.

    Yet another possibility is to follow "work harder not smarter"
    principle, i.e. process the whole square rather than just a relevant >triangle.

    I don't see a triangle in the code above. There may be some more
    outer loop involved that varies li from 0 to keylocks.len() or
    something, but the code that is presented processes a square.

    - anton

    My mistake. Somehow I misread the code like
    for l in li..keylocks.len()
    for k in 0..l

    Please ignore "smarter vs harder" part.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Thu Feb 6 16:00:42 2025
    Michael S wrote:
    On Thu, 06 Feb 2025 10:59:39 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    This resulted in just 12 instructions to handle 32 tests.


    That sounds suboptimal.
    By unrolling outer loop by 2 or 3 you can greatly reduce the number
    of memory accesses per comparison.

    Looking at the inner loop code shown in
    <2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do
    not include the loop overhead and are already unrolled by a factor of
    4 (32 for the scalar code). The loop overhead is 3 instructions, for
    a total of 15 instructions per iteration.

    The speed up would depend on specific
    microarchiture, but I would guess that at least 1.2x speedup is
    here.

    Even if you completely eliminate the loop overhead, the number of
    instructions is reduced by at most a factor 1.25, and I expect that
    the speedup from further unrolling is a factor of at most 1 on most
    CPUs (factor <1 can come from handling the remaining elements slowly,
    which does not seem unlikely for code coming out of gcc and clang).

    - anton

    The point of my proposal is not reduction of loop overhead and not
    reduction of the # of x86 instructions (in fact, with my proposal the #
    of x86 instructions is increased), but reduction in # of uOps due to
    reuse of loaded values.
    The theory behind it is that most typically in code with very high
    IPC like the one above the main bottleneck is the # of uOps that flows through rename stage.

    Aha! I see what you mean: Yes, this would be better if the

    VPAND reg,reg,[mem]

    instructions actually took more than one cycle each, but as the size of
    the arrays were just 1000 bytes each (250 keys + 250 locks), everything
    fits easily in $L1. (BTW, I did try to add 6 dummy keys and locks just
    to avoid any loop end overhead, but that actually ran slower.)

    Not counting loop overhead, an original 1x4 inner loop consists of 12 instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that
    does the same amount of work. New inner loop contains only RISC-like instructions - 14 instructions, 14 uOps.
    With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more
    work done per iteration.

    Another factor that can contribute to a speedup is increased number
    of iterations in the inner loop - from 1..7 iterations in original to
    1..15 in both of my above mentioned variants.


    Yet another possibility is to follow "work harder not smarter"
    principle, i.e. process the whole square rather than just a relevant triangle. The main gain is that loop detector would be able to predict
    the # of iterations in the inner loop, avoiding mispredicted branch at
    the end. If we follow this pass then it probably makes sense to
    not unroll an inner loop beyond SIMD factor of 8 and instead unroll an
    outer loop by 4.
    Going by intuition, in this particular application "smarter" wins
    over "harder", but we know that intuition sucks. Including mine :(

    Yeah, it was the actual measured performance which amazed me. :-)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Thu Feb 6 15:58:07 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    Concerning SIMD: trouble here is increasing vector length and
    consequently increasing alignment requirements.

    That is not a necessary consequence, on the contrary: alignment
    requirements based on SIMD granularity is hardware designer lazyness,
    but means that SIMD cannot be used for many of the applications where
    SIMD without that limitation can be used.

    If you want to have alignment checks, then a SIMD instruction should
    check for element alignment, not for SIMD alignment.

    But the computer architecture trend is clear: General-purpose
    computers do not have alignment restrictions; all that had them have
    been discontinued; the last one that had them was SPARC.

    Trend is clear, but there is a question: is it good trend.
    You wrot about lazy hardware designers, but there is much
    more lazy programmers. There are situations when unaligned
    access is needed, but significant proportion of unaligned
    accesses is not needed at all. At best such unaligned
    accesses lead to small performance loss, but they may also
    be latent bugs. There are cases when unaligned accesses
    are better than aligned ones, for that architecture
    should have apropriate instructions.

    A lot of SIMD
    code is memory-bound and current way of doing misaligned
    access leads to worse performance. So really no good way
    to solve this. In principle set of buffers for 2 cache lines
    each and appropriate shifters could give optimal troughput,
    but probably would lead to increased latency.

    AFAIK that's what current microarchitectures do, and in many cases
    with small penalties for unaligned accesses; see https://www.complang.tuwien.ac.at/anton/unaligned-stores/

    You call doubling store time 'small penalty'. For me in
    performance critical loop 10% matter and it is worth
    aligning things to avoid such loss. And what you present
    does not look like what I wrote above: AFAICS what Intel
    do is within single cache line and there is penalty when
    crossing lines (with 2 cache lines buffers there would be
    no penalty for line crossing).

    For me much more important are loads. First, there is more of
    them. Second, stores can be buffered and latency of store itself
    is of little importance (latency from store to load matters).
    For loads extra things in load path increase latency and that
    may limit program speed.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Terje Mathisen on Thu Feb 6 17:47:30 2025
    Terje Mathisen wrote:
    Michael S wrote:
    The point of my proposal is not reduction of loop overhead and not
    reduction of the # of x86 instructions (in fact, with my proposal the #
    of x86 instructions is increased), but reduction in # of uOps due to
    reuse of loaded values.
    The theory behind it is that most typically in code with very high
    IPC like the one above the main bottleneck is the # of uOps that flows
    through rename stage.

    Aha! I see what you mean: Yes, this would be better if the

      VPAND reg,reg,[mem]

    instructions actually took more than one cycle each, but as the size of
    the arrays were just 1000 bytes each (250 keys + 250 locks), everything
    fits easily in $L1. (BTW, I did try to add 6 dummy keys and locks just
    to avoid any loop end overhead, but that actually ran slower.)

    I've just tested it by running either 2 or 4 locks in parallel in the
    inner loop: The fastest time I saw actually did drop a smidgen, from
    5800 ns to 5700 ns (for both 2 and 4 wide), with 100 ns being the timing resolution I get from the Rust run_benchmark() function.

    So yes, it is slightly better to run a stripe instead of just a single
    row in each outer loop.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Thu Feb 6 18:19:09 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    But the computer architecture trend is clear: General-purpose
    computers do not have alignment restrictions; all that had them have
    been discontinued; the last one that had them was SPARC.

    Trend is clear, but there is a question: is it good trend.
    You wrot about lazy hardware designers, but there is much
    more lazy programmers.

    Lazy programmers use high-level languages which align everything
    anyway.

    There are situations when unaligned
    access is needed, but significant proportion of unaligned
    accesses is not needed at all.

    What evidence do you have for this claim?

    At best such unaligned
    accesses lead to small performance loss,

    They may also lead to a small performance win.

    I tried to turn on alignment checks:

    First on IA-32: There I found that memcpy() etc. uses unaligned
    accesses, but I could replace these functions. But then I found that
    8-byte FP numbers are aligned at 4-byte boundaries because the ABI
    says so, but the alignment check faults in that case. So I gave up on
    turning on alignment checks.

    Later on AMD64: The ABI does not have that bug there, and I worked
    around memcpy() etc. However, I found that gcc produced unaligned
    2-byte accesses (rather than 2 1-byte accesses) for things like strcpy("w",var). I did not find a way to suppress that code
    generation freature of gcc, so I gave up on this attempt.

    Did I find any cases on AMD64 where I think there will be a
    performance loss? No, on the contrary, I expect that, on average, the
    2-byte acceses will be faster than two one-byte accesses. And
    unaligned accesses on memcpy() are clearly a win over accessing the
    memory byte-by-byte.

    There are cases when unaligned accesses
    are better than aligned ones, for that architecture
    should have apropriate instructions.

    SSE has MOVDQU and MOVDQA. MOVDQA is completely pointless, because it
    checks for 16-byte alignment, rather than element alignment. If
    designed properly, it would have MOVDQ2A, MOVDQ4A, MOVDQ8A. But do we
    actually need it? The experience mentioned above indicates that we
    don't.

    You call doubling store time 'small penalty'. For me in
    performance critical loop 10% matter and it is worth
    aligning things to avoid such loss.

    The question is how much of the loop is spent in loads and stores, and
    how do you avoid the unaligned accesses: E.g., for the case mentioned
    earlier

    for (i=0; i<n; i++)
    a[i] = b[i] + c[i];

    For performance reasons, you want to use SIMD instructions for that
    and align each SIMD memory access to SIMD granularity. But what if a,
    b, c have different starting points modulo the SIMD granularity?

    For me much more important are loads.

    My data for loads is older (and for older hardware): <http://al.howardknight.net/?ID=143135464800>. But the links to the
    benchmarks are there, you can measure it on modern hardware. Maybe I
    will find the time at some point and measure modern hardware.

    First, there is more of
    them. Second, stores can be buffered and latency of store itself
    is of little importance (latency from store to load matters).
    For loads extra things in load path increase latency and that
    may limit program speed.

    I notice that the SiFive CPUs have no proper hardware support for
    unaligned accesses and have much lower clock rate than the Intel, AMD,
    ARM, Apple, and Qualcomm cores that support unaligned accesses. So
    the evidence does not support your claim.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Thu Feb 6 21:19:32 2025
    On Thu, 6 Feb 2025 17:47:30 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Terje Mathisen wrote:
    Michael S wrote:
    The point of my proposal is not reduction of loop overhead and not
    reduction of the # of x86 instructions (in fact, with my proposal
    the # of x86 instructions is increased), but reduction in # of
    uOps due to reuse of loaded values.
    The theory behind it is that most typically in code with very high
    IPC like the one above the main bottleneck is the # of uOps that
    flows through rename stage.

    Aha! I see what you mean: Yes, this would be better if the

      VPAND reg,reg,[mem]

    instructions actually took more than one cycle each, but as the
    size of the arrays were just 1000 bytes each (250 keys + 250
    locks), everything fits easily in $L1. (BTW, I did try to add 6
    dummy keys and locks just to avoid any loop end overhead, but that
    actually ran slower.)

    I've just tested it by running either 2 or 4 locks in parallel in the
    inner loop: The fastest time I saw actually did drop a smidgen, from
    5800 ns to 5700 ns (for both 2 and 4 wide), with 100 ns being the
    timing resolution I get from the Rust run_benchmark() function.

    So yes, it is slightly better to run a stripe instead of just a
    single row in each outer loop.

    Terje


    Assuming that your CPU is new and runs at decent frequency (4-4.5 GHz),
    the results are 2-3 times slower than expected. I would guess that it
    happens because there are too few iterations in the inner loop.
    Turning unrolling upside down, as I suggested in the previous post,
    should fix it.
    Very easy to do in C with intrinsic. Probably not easy in Rust.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Thu Feb 6 21:36:38 2025
    Michael S wrote:
    On Thu, 6 Feb 2025 17:47:30 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Terje Mathisen wrote:
    Michael S wrote:
    The point of my proposal is not reduction of loop overhead and not
    reduction of the # of x86 instructions (in fact, with my proposal
    the # of x86 instructions is increased), but reduction in # of
    uOps due to reuse of loaded values.
    The theory behind it is that most typically in code with very high
    IPC like the one above the main bottleneck is the # of uOps that
    flows through rename stage.

    Aha! I see what you mean: Yes, this would be better if the

      VPAND reg,reg,[mem]

    instructions actually took more than one cycle each, but as the
    size of the arrays were just 1000 bytes each (250 keys + 250
    locks), everything fits easily in $L1. (BTW, I did try to add 6
    dummy keys and locks just to avoid any loop end overhead, but that
    actually ran slower.)

    I've just tested it by running either 2 or 4 locks in parallel in the
    inner loop: The fastest time I saw actually did drop a smidgen, from
    5800 ns to 5700 ns (for both 2 and 4 wide), with 100 ns being the
    timing resolution I get from the Rust run_benchmark() function.

    So yes, it is slightly better to run a stripe instead of just a
    single row in each outer loop.

    Terje


    Assuming that your CPU is new and runs at decent frequency (4-4.5 GHz),
    the results are 2-3 times slower than expected. I would guess that it
    happens because there are too few iterations in the inner loop.
    Turning unrolling upside down, as I suggested in the previous post,
    should fix it.
    Very easy to do in C with intrinsic. Probably not easy in Rust.

    I did mention that this is a (cheap) laptop? It is about 15 months old,
    and with a base frequency of 2.676 GHz. I guess that would explain most
    of the difference between what I see and what you expected?

    BTW, when I timed 1000 calls to that 5-6 us program, to get around teh
    100 ns timer resolution, each iteration ran in 5.23 us.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Thu Feb 6 23:32:00 2025
    On Thu, 6 Feb 2025 21:36:38 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Thu, 6 Feb 2025 17:47:30 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Terje Mathisen wrote:
    Michael S wrote:
    The point of my proposal is not reduction of loop overhead and
    not reduction of the # of x86 instructions (in fact, with my
    proposal the # of x86 instructions is increased), but reduction
    in # of uOps due to reuse of loaded values.
    The theory behind it is that most typically in code with very
    high IPC like the one above the main bottleneck is the # of uOps
    that flows through rename stage.

    Aha! I see what you mean: Yes, this would be better if the

      VPAND reg,reg,[mem]

    instructions actually took more than one cycle each, but as the
    size of the arrays were just 1000 bytes each (250 keys + 250
    locks), everything fits easily in $L1. (BTW, I did try to add 6
    dummy keys and locks just to avoid any loop end overhead, but that
    actually ran slower.)

    I've just tested it by running either 2 or 4 locks in parallel in
    the inner loop: The fastest time I saw actually did drop a
    smidgen, from 5800 ns to 5700 ns (for both 2 and 4 wide), with 100
    ns being the timing resolution I get from the Rust run_benchmark()
    function.

    So yes, it is slightly better to run a stripe instead of just a
    single row in each outer loop.

    Terje


    Assuming that your CPU is new and runs at decent frequency (4-4.5
    GHz), the results are 2-3 times slower than expected. I would guess
    that it happens because there are too few iterations in the inner
    loop. Turning unrolling upside down, as I suggested in the previous
    post, should fix it.
    Very easy to do in C with intrinsic. Probably not easy in Rust.

    I did mention that this is a (cheap) laptop? It is about 15 months
    old, and with a base frequency of 2.676 GHz.

    You describe it different ways but omit the only one that will give us sufficient information - CPU model number.

    I guess that would
    explain most of the difference between what I see and what you
    expected?

    BTW, when I timed 1000 calls to that 5-6 us program, to get around
    teh 100 ns timer resolution, each iteration ran in 5.23 us.

    Terje



    That measurement could be good enough on desktop. Or not.
    It certainly not good enough on laptop and even less so on server.
    On laptop I wouldn't be sutisfied before I lok my program to
    particualr core, then do something like 21 measurements with 100K calls
    in each measurement (~10 sec total) and report median of 21.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Feb 7 02:28:09 2025
    On Thu, 6 Feb 2025 23:34:27 +0000, BGB wrote:

    On 2/6/2025 2:36 PM, Terje Mathisen wrote:
    Michael S wrote:
    On Thu, 6 Feb 2025 17:47:30 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:


    FWIW: The idea of running a CPU at 4+ GHz seems a bit much (IME, CPUs
    tend to run excessively hot at these kinds of clock speeds; 3.2 to 3.6 seemingly more reasonable so that it "doesn't melt", or have thermal throttling or stability issues).

    Does the idea of using all 500 HP of your car give you similar
    reservations ?!?

    Then again CPU heat production is between quadratic and cubic
    wrt frequency... it is k×V^2×f and we have to raise the voltage
    in order to run at higher frequencies.

    <snip>


    A smaller pagefile still exists on the SSD, but mostly because Windows
    is unhappy if there is no pagefile on 'C'. Don't generally want a
    pagefile on an SSD though as it is worse for lifespan (but, it is 8GB,
    which Windows accepts; with around 192GB each on the other drives, for ~ 400GB of swap space).

    I have not had a swap file on C since 1997!
    I have a separate SATA drive for swap.
    The only times I use the C drive for swap is the initial bootup and configuration of the system. Afterwards, I install the swap drive,
    call it S, and allow the system to sue 95% of it. Unstable when
    using 100% of the space available.
    I also have OS and applications on C
    but all my files are on M or P;
    so a reload does not damage any of my work files (just my time)

    Not sure how well Windows load-balances swap, apparently not very well
    though (when it starts paging, most of the load seems to be on one
    drive; better if it could give a more even spread).

    The SSD seems to get ~ 300 MB/sec.

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Fri Feb 7 12:41:38 2025
    On Fri, 7 Feb 2025 11:06:43 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Thu, 6 Feb 2025 21:36:38 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    BTW, when I timed 1000 calls to that 5-6 us program, to get around
    teh 100 ns timer resolution, each iteration ran in 5.23 us.

    That measurement could be good enough on desktop. Or not.
    It certainly not good enough on laptop and even less so on server.
    On laptop I wouldn't be sutisfied before I lok my program to
    particualr core, then do something like 21 measurements with 100K
    calls in each measurement (~10 sec total) and report median of 21.

    Each measurement did 1000 calls, then I ran 100 such measurements.
    The 5.23 us value was the lowest seen among the 100, with average a
    bit more:


    Slowest: 9205200 ns
    Fastest: 5247500 ns
    Average: 5672529 ns/iter
    Part1 = 3338

    My own (old, but somewhat kept up to date) cputype program reported
    that it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to CPUID.

    Is that sufficient to judge the performance?

    Terje


    Not really.
    i7-1365U is a complicated beast. 2 "big" cores, 8 "medium" cores.
    Frequency varies ALOT, 1.8 to 5.2 GHz on "big", 1.3 to 3.9 GHz on
    "medium".
    As I said above, on such CPU I wouldn't believe the numbers before
    total duration of test is 10 seconds and the test run is locked to
    particular core. As to 5 msec per measurement, that's enough, but why
    not do longer measurements if you have to run for 10 sec anyway?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Fri Feb 7 15:23:51 2025
    Michael S wrote:
    On Fri, 7 Feb 2025 11:06:43 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Thu, 6 Feb 2025 21:36:38 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    BTW, when I timed 1000 calls to that 5-6 us program, to get around
    teh 100 ns timer resolution, each iteration ran in 5.23 us.

    That measurement could be good enough on desktop. Or not.
    It certainly not good enough on laptop and even less so on server.
    On laptop I wouldn't be sutisfied before I lok my program to
    particualr core, then do something like 21 measurements with 100K
    calls in each measurement (~10 sec total) and report median of 21.

    Each measurement did 1000 calls, then I ran 100 such measurements.
    The 5.23 us value was the lowest seen among the 100, with average a
    bit more:


    Slowest: 9205200 ns
    Fastest: 5247500 ns
    Average: 5672529 ns/iter
    Part1 = 3338

    My own (old, but somewhat kept up to date) cputype program reported
    that it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to CPUID.

    Is that sufficient to judge the performance?

    Terje


    Not really.
    i7-1365U is a complicated beast. 2 "big" cores, 8 "medium" cores.
    Frequency varies ALOT, 1.8 to 5.2 GHz on "big", 1.3 to 3.9 GHz on
    "medium".

    OK. It seems like the big cores are similar to what I've had previously,
    i.e. each core supports hyperthreading, while the medium ones don't.
    This results in 12 HW threads.

    As I said above, on such CPU I wouldn't believe the numbers before
    total duration of test is 10 seconds and the test run is locked to
    particular core. As to 5 msec per measurement, that's enough, but why
    not do longer measurements if you have to run for 10 sec anyway?

    The Advent of Code task required exactly 250 keys and 250 locks to be
    tested, this of course fits easily in a corner of $L1 (2000 bytes).

    The input file to be parsed was 43*500 = 21500 bytes long, so this
    should also fit in $L1 when I run repeated tests.

    Under Windows I can set thread affinity to lock a process to a given
    core, but how do I know which are "Big" and "Medium"?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Fri Feb 7 17:04:23 2025
    On Fri, 7 Feb 2025 15:23:51 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Fri, 7 Feb 2025 11:06:43 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Thu, 6 Feb 2025 21:36:38 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    BTW, when I timed 1000 calls to that 5-6 us program, to get
    around teh 100 ns timer resolution, each iteration ran in 5.23
    us.

    That measurement could be good enough on desktop. Or not.
    It certainly not good enough on laptop and even less so on server.
    On laptop I wouldn't be sutisfied before I lok my program to
    particualr core, then do something like 21 measurements with 100K
    calls in each measurement (~10 sec total) and report median of
    21.

    Each measurement did 1000 calls, then I ran 100 such measurements.
    The 5.23 us value was the lowest seen among the 100, with average a
    bit more:


    Slowest: 9205200 ns
    Fastest: 5247500 ns
    Average: 5672529 ns/iter
    Part1 = 3338

    My own (old, but somewhat kept up to date) cputype program reported
    that it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to
    CPUID.

    Is that sufficient to judge the performance?

    Terje


    Not really.
    i7-1365U is a complicated beast. 2 "big" cores, 8 "medium" cores.
    Frequency varies ALOT, 1.8 to 5.2 GHz on "big", 1.3 to 3.9 GHz on
    "medium".

    OK. It seems like the big cores are similar to what I've had
    previously, i.e. each core supports hyperthreading, while the medium
    ones don't. This results in 12 HW threads.

    As I said above, on such CPU I wouldn't believe the numbers before
    total duration of test is 10 seconds and the test run is locked to particular core. As to 5 msec per measurement, that's enough, but
    why not do longer measurements if you have to run for 10 sec
    anyway?

    The Advent of Code task required exactly 250 keys and 250 locks to be
    tested, this of course fits easily in a corner of $L1 (2000 bytes).

    The input file to be parsed was 43*500 = 21500 bytes long, so this
    should also fit in $L1 when I run repeated tests.

    Under Windows I can set thread affinity to lock a process to a given
    core, but how do I know which are "Big" and "Medium"?

    Trial and error?
    I think, big cores/threads tend to be with lower numbers, but I am not
    sure it is universal.



    Terje


    In the mean time.
    I did few measurements on Xeon E3 1271 v3. That is rather old uArch -
    Haswell, the first core that supports AVX2. During the tests it was
    running at 4.0 GHz.

    1. Original code (rewritten in plain C) compiled with clang -O3 -march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
    compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
    AVX2 code compiled with clang -O3 -march=skylake (AVX2)

    Results were as following (usec/call)
    1 - 5.66
    2 - 5.56
    3 - 2.18

    So, my measurements, similarly to your measurements, demonstrate that
    clang autovectorized code looks good, but performs not too good.


    Here is my manual code. Handling of the tail is too clever. I did not
    have time to simplify. Otherwise, for 250x250 it should perform about
    the same as simpler code.

    #include <stdint.h>
    #include <immintrin.h>

    int foo_tst(const uint32_t* keylocks, int len, int li)
    {
    if (li >= len || li <= 0)
    return 0;
    const uint32_t* keyx = &keylocks[li];
    unsigned ni = len - li;
    __m256i res0 = _mm256_setzero_si256();
    __m256i res1 = _mm256_setzero_si256();
    __m256i res2 = _mm256_setzero_si256();
    __m256i res3 = _mm256_setzero_si256();
    const uint32_t* keyx_last = &keyx[ni & -32];
    for (; keyx != keyx_last; keyx += 32) {
    __m256i lock0 = _mm256_loadu_si256((const __m256i*)&keyx[0*8]);
    __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyx[1*8]);
    __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyx[2*8]);
    __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyx[3*8]);
    // for (int k = 0; k < li; ++k) {
    // for (int k = 0, nk = li; nk > 0; ++k, --nk) {
    for (const uint32_t* keyy = keylocks; keyy != &keylocks[li];
    ++keyy) { // __m256i lockk =
    _mm256_castps_si256(_mm256_broadcast_ss((const float*)&keylocks[k]));
    __m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
    float*)keyy)); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
    _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } } int res = 0; if (ni % 32) { uint32_t
    tmp[32]; const uint32_t* keyy_last = &keylocks[li & -32]; if (li % 32) {
    for (int k = 0; k < li % 32; ++k)
    tmp[k] = keyy_last[k];
    for (int k = li % 32; k < 32; ++k)
    tmp[k] = (uint32_t)-1;
    }
    const uint32_t* keyx_last = &keyx[ni % 32];
    int nz = 0;
    for (; keyx != keyx_last; keyx += 1) {
    if (*keyx) {
    __m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const float*)keyx)); for (const uint32_t* keyy = keylocks; keyy != keyy_last;
    keyy += 32) { __m256i lock0 = _mm256_loadu_si256((const
    __m256i*)&keyy[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyy[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyy[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyy[3*8]); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
    _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } if (li % 32) { __m256i lock0 = _mm256_loadu_si256((const __m256i*)&tmp[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&tmp[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&tmp[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&tmp[3*8]); res0 =
    _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
    lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } } else { nz += 1; } } res = nz * li; }
    // fold accumulators
    res0 = _mm256_add_epi32(res0, res2);
    res1 = _mm256_add_epi32(res1, res3);
    res0 = _mm256_add_epi32(res0, res1);
    res0 = _mm256_hadd_epi32(res0, res0);
    res0 = _mm256_hadd_epi32(res0, res0);

    res += _mm256_extract_epi32(res0, 0);
    res += _mm256_extract_epi32(res0, 4);
    return res;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Fri Feb 7 15:24:41 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 7 Feb 2025 15:23:51 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:


    Under Windows I can set thread affinity to lock a process to a given
    core, but how do I know which are "Big" and "Medium"?

    Trial and error?

    Check the ACPI tables, which should have that information for
    each core.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sat Feb 8 08:11:04 2025
    Michael S <already5chosen@yahoo.com> writes:
    In the mean time.
    I did few measurements on Xeon E3 1271 v3. That is rather old uArch - >Haswell, the first core that supports AVX2. During the tests it was
    running at 4.0 GHz.

    1. Original code (rewritten in plain C) compiled with clang -O3 >-march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
    compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
    AVX2 code compiled with clang -O3 -march=skylake (AVX2)

    Results were as following (usec/call)
    1 - 5.66
    2 - 5.56
    3 - 2.18

    In the meantime, I also wrote the original code in plain C
    (keylocks1.c), then implemented your idea of unrolling the outer loop
    and comparing a subarray of locks to each key (is this called strip
    mining?) in plain C (with the hope that auto-vectorization works) as keylocks2.c, and finally rewrote the latter version to use gcc vector extensions (keylocks3.c). I wrote a dummy main around that that calls
    the routine 100_000 times; given that the original routine's
    performance does not depend on the data, and I used non-0 keys (so keylocks[23].c does not skip any keys), the actual data is not
    important.

    You can find the source code and the binaries I measured at <http://www.complang.tuwien.ac.at/anton/keylock/>. The binaries were
    compiled with gcc 12.2.0 and (in the clang subdirectory) clang-14.0.6;
    the clang compilations sometimes used different UNROLL factors than
    the gcc compilations (and I am unsure, which, see below).

    The original code is:

    unsigned keylocks(unsigned keys[], unsigned nkeys, unsigned locks[], unsigned nlocks)
    {
    unsigned i, j;
    unsigned part1 = 0;
    for (i=0; i<nlocks; i++) {
    unsigned lock = locks[i];
    for (j=0; j<nkeys; j++)
    part1 += (lock & keys[j])==0;
    }
    return part1;
    }

    For keylocks2.c the central loops are:

    for (i=0; i<UNROLL; i++)
    part0[i]=0;
    for (i=0; i<nlocks1; i+=UNROLL) {
    for (j=0; j<nkeys1; j++) {
    unsigned key = keys1[j];
    for (k=0; k<UNROLL; k++)
    part0[k] += (locks1[i+k] & key)==0;
    }
    }

    For UNROLL I tried 8, 16, and 32 for AVX2 and 16, 32, or 64 for
    AVX-512; the numbers below are for those factors that produce the
    lowest cycles on the Rocket Lake machine.

    The central loops are preceded by code to arrange the data such that
    this code works: locks are copied to the longer locks1; the length of
    locks1 is a multiple of UNROLL, and the entries beyond nlocks are ~0
    to increase the count by 0) and the keys are copies to keys1 (with 0
    removed so that the extra locks are not counted, and that also may
    increase efficiency if there is a key=0). The central loops are
    followed by summing up the elements of part0.

    keylocks3.c, which uses the gcc vector extensions, just changes
    keylocks2.c in a few places. In particular, it adds a type vu:

    typedef unsigned vu __attribute__ ((vector_size (UNROLL*sizeof(unsigned))));

    The central loops now look as follows:

    for (i=0; i<UNROLL; i++)
    part0[i]=0;
    for (i=0; i<nlocks1; i+=UNROLL) {
    vu lock = *(vu *)(locks1+i);
    for (j=0; j<nkeys1; j++) {
    part0 -= (lock & keys1[j])==0;
    }

    One interesting aspect of the gcc vector extensions is that the result
    of comparing two vectors is 0 (false) or ~0 (true) (per element),
    whereas for scalars the value for true is 1. Therefore the code above
    updates part0 with -=, whereas in keylocks2.c += is used.

    While the use of ~0 is a good choice when designing a new programming
    language, I would have gone for 1 in the case of a vector extension
    for C, for consistency with the scalar case; in combination with
    hardware that produces ~0 (e.g., Intel SSE and AVX SIMD stuff), that
    means that the compiler will introduce a negation in its intermediate representation at some point; I expect that compilers will usually be
    able to optimize this negation away, but I would not be surprised at
    cases where my expectation is disappointed.

    keylocks3.c compiles without warning on clang, but the result usually
    segfaults (but sometime does not, e.g., in the timed run on Zen4; it
    segfaults in other runs on Zen4). I have not investigated why this
    happens, I just did not include results from runs where it segfaulted;
    and I tried additional runs for keylocks3-512 on Zen4 in order to have
    one result there.

    I would have liked to compare the performance of my code against your
    code, but your code apparently was destroyed by arbitrary line
    breaking in your news-posting software. Anyway, here are my results.
    First cycles (which eliminates worries about turbo modes) and
    instructions, then usec/call.

    The cores are:

    Haswell: Core i7-4790K (similar to Michael S.'s CPU)
    Golden Cove: Core i3-1315U (same P-core as Terje Mathisen's laptop)
    Gracemont: Core i3-1315U (same E-core as Terje Mathisen's laptop)
    Rocket Lake: Xeon W-1370P (2021 Intel CPU)
    Zen4: Ryzen 7 8700G (2024)

    I also measured Tiger Lake (Core i5-1135G7, the CPU of a 2021 laptop),
    but the results were very close to the Rocket Lake results, so because
    of the limited table width, I do not show them.

    The first three cores do not support AVX-512, the others do.

    Cycles:
    Haswell Golden Cove Gracemont Rocket Lake Zen4
    1818_241431 1433_208539 1778_675623 2_365_664_737 1_677_853_186 gcc avx2 1 1051_191216 1189_869807 1872_856423 981_948_517 727_418_069 gcc avx2 2 8 1596_783872 1213_400891 2076_677426 1_690_280_182 913_746_088 gcc avx2 3 8 2195_438821 1638_006130 2577_451872 2_291_743_879 1_617_970_157 clang avx2 1 2757_454335 2151_198125 2506_427284 3_174_899_185 1_523_870_829 clang avx2 2 8?
    638_374_463 clang avx2 3 8?
    1_139_175_457 1_219_164_672 gcc 512 1
    856_818_642 900_108_135 gcc 512 2 32
    866_077_594 1_072_172_449 gcc 512 3 16
    2_479_213_408 1_479_937_930 clang 512 1
    912_273706 936_311567 847_289_380 634_826_441 clang 512 2 16?
    636_278_210 clang 512 3 16?

    avx2 means: compiled with -mavx2; 512 means: compiled with
    -march=x86-64-v4 (I usually did not measure those on machines that do
    not support AVX-512, because I expected the results to not work; I
    later measured some clang's keylocks2-512 on some of those machines).
    The number behind that ist the keylocks[123].c variant, and the number
    behind that (if present) the UNROLL parameter. I am not sure about
    the UNROLL numbers used for clang, but in any case I kept what
    performed best on Rocket Lake. The number of instructions executed is (reported on the Zen4):

    instructions
    5_779_542_242 gcc avx2 1
    3_484_942_148 gcc avx2 2 8
    5_885_742_164 gcc avx2 3 8
    7_903_138_230 clang avx2 1
    7_743_938_183 clang avx2 2 8?
    3_625_338_104 clang avx2 3 8?
    4_204_442_194 gcc 512 1
    2_564_142_161 gcc 512 2 32
    3_061_042_178 gcc 512 3 16
    7_703_938_205 clang 512 1
    3_402_238_102 clang 512 2 16?
    3_320_455_741 clang 512 3 16?

    for gcc -mavx2 on keylocks3.c on Zen 4 an IPC of 6.44 is reported,
    while microarchitecture descriptions report only a 6-wide renamer <https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine>. My guess is that the front end combined some instructions (maybe
    compare and branch) into a macro-op, and the renamer then processed 6
    macro-ops that represented more instructions. The inner loop is

    │190: vpbroadcastd (%rax),%ymm0
    1.90 │ add $0x4,%rax
    │ vpand %ymm2,%ymm0,%ymm0
    1.09 │ vpcmpeqd %ymm3,%ymm0,%ymm0
    0.41 │ vpsubd %ymm0,%ymm1,%ymm1
    78.30 │ cmp %rdx,%rax
    │ jne 190

    and if the cmp and jne are combined into one macro-op, that would be
    perfect for executing one iteration per cycle.

    It's interesting that gcc's keylocks2-256 results on far fewer
    instructions (and eventually, cycles). It unrolls the inner loop 8
    times to process the keys in SIMD fashion, too, loading the keys one
    ymm register at a time. In order to do that it arranges the locks in
    8 different ymm registers in the outer loop, so the inner loop
    performs 8 sequences similar to

    vpand %ymm0,%ymm15,%ymm2
    vpcmpeqd %ymm1,%ymm2,%ymm2
    vpsubd %ymm2,%ymm4,%ymm4

    surrounded by

    300: vmovdqu (%rsi),%ymm0
    add $0x20,%rsi
    [8 3-instruction sequences]
    cmp %rsi,%rdx
    jne 300

    It also uses 8 ymm accumulators, so not all of that fits into
    registers, so three of the anded values are stored on the stack. For
    Zen4 this could be improved by using only 2 accumulators. In any
    case, the gcc people did something clever here, and I do not
    understand how they got there from the source code, and why they did
    not get there from keylocks1.c.

    For clang's keylocks3-256 the inner loop and the outer loop are each
    unrolled two times, resulting in and inner loop like:

    190: vpbroadcastd (%r12,%rbx,4),%ymm5
    vpand %ymm3,%ymm5,%ymm6
    vpand %ymm4,%ymm5,%ymm5
    vpcmpeqd %ymm1,%ymm5,%ymm5
    vpsubd %ymm5,%ymm2,%ymm2
    vpcmpeqd %ymm1,%ymm6,%ymm5
    vpsubd %ymm5,%ymm0,%ymm0
    vpbroadcastd 0x4(%r12,%rbx,4),%ymm5
    vpand %ymm4,%ymm5,%ymm6
    vpand %ymm3,%ymm5,%ymm5
    vpcmpeqd %ymm1,%ymm5,%ymm5
    vpsubd %ymm5,%ymm0,%ymm0
    vpcmpeqd %ymm1,%ymm6,%ymm5
    vpsubd %ymm5,%ymm2,%ymm2
    add $0x2,%rbx
    cmp %rbx,%rsi
    jne 190

    This results in the lowest AVX2 cycles, and I expect that one can use
    that approach without crash problems without adding too many cycles.
    The clang -march=x86-64-v4 results have similar code (with twice as
    much inner-loop unrolling in case of keylocks3-512), but they all only
    use AVX2 instructions and there have been successful runs on a Zen2
    (which does not support AVX-512). It seems that clang does not
    support AVX-512, or it does not understand -march=x86-64-v4 to allow
    more than AVX2.

    The least executed instructions is with gcc's keylocks2-512, where the
    inner loop is:

    230: vpbroadcastd 0x4(%rax),%zmm4
    vpbroadcastd (%rax),%zmm0
    mov %edx,%r10d
    add $0x8,%rax
    add $0x2,%edx
    vpandd %zmm4,%zmm8,%zmm5
    vpandd %zmm0,%zmm8,%zmm9
    vpandd %zmm4,%zmm6,%zmm4
    vptestnmd %zmm5,%zmm5,%k1
    vpandd %zmm0,%zmm6,%zmm0
    vmovdqa32 %zmm7,%zmm5{%k1}{z}
    vptestnmd %zmm9,%zmm9,%k1
    vmovdqa32 %zmm3,%zmm9{%k1}{z}
    vptestnmd %zmm4,%zmm4,%k1
    vpsubd %zmm9,%zmm5,%zmm5
    vpaddd %zmm5,%zmm2,%zmm2
    vmovdqa32 %zmm7,%zmm4{%k1}{z}
    vptestnmd %zmm0,%zmm0,%k1
    vmovdqa32 %zmm3,%zmm0{%k1}{z}
    vpsubd %zmm0,%zmm4,%zmm0
    vpaddd %zmm0,%zmm1,%zmm1
    cmp %r10d,%r8d
    jne 230

    Due to UNROLL=32, it deals with 2 zmm registers coming from the outer
    loop at a time, and the inner loop is unrolled by a factor of 2, too.
    It uses vptestnmd and a predicated vmovdqa32 instead of using vpcmpeqd
    (why?). Anyway, the code seems to rub Zen4 the wrong way, and it
    performs only at 2.84 IPC, worse than the AVX2 code. Rocket Lake
    performs slightly better, but still, the clang code for keylocks2-512
    runs a bit faster without using AVX-512.

    I also saw one case where the compiler botched it:

    gcc -Wall -DUNROLL=16 -O3 -mavx2 -c keylocks3.c

    [/tmp/keylock:155546] LC_NUMERIC=prog perf stat -e cycles -e instructions keylocks3-256
    603800000

    Performance counter stats for 'keylocks3-256':

    17_476_700_581 cycles
    39_480_242_683 instructions # 2.26 insn per cycle

    3.506995312 seconds time elapsed

    3.507020000 seconds user
    0.000000000 seconds sys

    (cycles and timings on the 8700G). Here the compiler failed to
    vectorize the comparison, and performed them using scalar instructions
    (first extracting the data from the SIMD registers, and finally
    inserting the result into SIMD registers, with additional overhead
    from spilling registers). The result requires about 10 times more
    instructions than the UNROLL=8 variant and almost 20 times more
    cycles.

    On to timings per routine invocation:

    On a 4.4Ghz Haswell (whereas Michael S. measured a 4GHz Haswell):
    5.47us clang keylocks1-256 (5.66us for Michael S.'s "original code")
    4.26us gcc keylocks1-256 (5.66us for Michael S.'s "original code")
    2.38us gcc keylocks2-256 (2.18us for Michael S.'s manual vectorized code) 2.08us clang keylocks2-512 (2.18us for Michael S.'s manual vectorized code)

    Michael S.'s "original code" performs similar on clang to my
    keylocks1.c. clang's keylocks2-512 code is quite competetive with his
    manual code.

    On the Golden Cove of a Core i3-1315U (compared to the best result by
    Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
    according to Intel, whereas the former can supposedly run up to
    4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
    as well):

    5.25us Terje Mathisen's Rust code compiled by clang (best on the 1365U)
    4.93us clang keylocks1-256 on a 3.8GHz 1315U
    4.17us gcc keylocks1-256 on a 3.8GHz 1315U
    3.16us gcc keylocks2-256 on a 3.8GHz 1315U
    2.38us clang keylocks2-512 on a 3.8GHz 1315U

    I would have expected the clang keylocks1-256 to run slower, because
    the compiler back-end is the same and the 1315U is slower. Measuring
    cycles looks more relevant for this benchmark to me than measuring
    time, especially on this core where AVX-512 is disabled and there is
    no AVX slowdown.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anssi Saari@21:1/5 to BGB on Sat Feb 8 14:30:21 2025
    BGB <cr88192@gmail.com> writes:

    If one runs their CPU at 4 GHz, then under multi-threaded load, it may
    hit 70C or so, frequency starts jumping all over (as it tries to keep temperature under control), and sometimes the computer will crash.

    To me that sounds more like inadequate cooling. When I last upgraded my
    desktop together I ran Prime95 torture test for a few hours. Temperature stabilized below where the CPU cores would throttle to cool down, which
    was expected. No crashes or wrong results either. This is just a basic
    game capable AMD Ryzen 5 5600X 6-Core system, I don't have logs of the
    clock speeds or temps from back then.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Sat Feb 8 13:36:29 2025
    MitchAlsup1 wrote:
    On Fri, 7 Feb 2025 15:04:23 +0000, Michael S wrote:
      res += _mm256_extract_epi32(res0, 0);
      res += _mm256_extract_epi32(res0, 4);
      return res;

    Simple question:: how would you port this code to a machine
    with a different SIMD instruction set ??

    Years ago I solved this problem for an optimized Ogg Vorbis decoder:

    I wrote a set of #defines which wrapped MMX/SSE intrinsics on the x86
    side and Motorola's more capable Altivec instructions on the Apple side.

    I had to limit myself a tiny bit in a couple of places, as well as
    expanding a Motorola operation into a pair of SSE instrinsics, but the resulting code still ran faster than all commercialy available libraries
    on both platforms.

    If/when those instrinsics diverge more, then the problem would be significantly harder, but back then both Altivec and SSE used 128-bit registers.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Sat Feb 8 13:24:25 2025
    Thanks a lot Anton and Michael S!

    With your access to multiple CPUs and compilers you got a lot more out
    of this tiny micro-benchmark than I ever expected.

    :-)


    Terje

    Anton Ertl wrote:
    Michael S <already5chosen@yahoo.com> writes:
    In the mean time.
    I did few measurements on Xeon E3 1271 v3. That is rather old uArch -
    Haswell, the first core that supports AVX2. During the tests it was
    running at 4.0 GHz.

    1. Original code (rewritten in plain C) compiled with clang -O3
    -march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
    compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
    AVX2 code compiled with clang -O3 -march=skylake (AVX2)

    Results were as following (usec/call)
    1 - 5.66
    2 - 5.56
    3 - 2.18

    In the meantime, I also wrote the original code in plain C
    (keylocks1.c), then implemented your idea of unrolling the outer loop
    and comparing a subarray of locks to each key (is this called strip
    mining?) in plain C (with the hope that auto-vectorization works) as keylocks2.c, and finally rewrote the latter version to use gcc vector extensions (keylocks3.c). I wrote a dummy main around that that calls
    the routine 100_000 times; given that the original routine's
    performance does not depend on the data, and I used non-0 keys (so keylocks[23].c does not skip any keys), the actual data is not
    important.

    You can find the source code and the binaries I measured at <http://www.complang.tuwien.ac.at/anton/keylock/>. The binaries were compiled with gcc 12.2.0 and (in the clang subdirectory) clang-14.0.6;
    the clang compilations sometimes used different UNROLL factors than
    the gcc compilations (and I am unsure, which, see below).

    The original code is:

    unsigned keylocks(unsigned keys[], unsigned nkeys, unsigned locks[], unsigned nlocks)
    {
    unsigned i, j;
    unsigned part1 = 0;
    for (i=0; i<nlocks; i++) {
    unsigned lock = locks[i];
    for (j=0; j<nkeys; j++)
    part1 += (lock & keys[j])==0;
    }
    return part1;
    }

    For keylocks2.c the central loops are:

    for (i=0; i<UNROLL; i++)
    part0[i]=0;
    for (i=0; i<nlocks1; i+=UNROLL) {
    for (j=0; j<nkeys1; j++) {
    unsigned key = keys1[j];
    for (k=0; k<UNROLL; k++)
    part0[k] += (locks1[i+k] & key)==0;
    }
    }

    For UNROLL I tried 8, 16, and 32 for AVX2 and 16, 32, or 64 for
    AVX-512; the numbers below are for those factors that produce the
    lowest cycles on the Rocket Lake machine.

    The central loops are preceded by code to arrange the data such that
    this code works: locks are copied to the longer locks1; the length of
    locks1 is a multiple of UNROLL, and the entries beyond nlocks are ~0
    to increase the count by 0) and the keys are copies to keys1 (with 0
    removed so that the extra locks are not counted, and that also may
    increase efficiency if there is a key=0). The central loops are
    followed by summing up the elements of part0.

    keylocks3.c, which uses the gcc vector extensions, just changes
    keylocks2.c in a few places. In particular, it adds a type vu:

    typedef unsigned vu __attribute__ ((vector_size (UNROLL*sizeof(unsigned))));

    The central loops now look as follows:

    for (i=0; i<UNROLL; i++)
    part0[i]=0;
    for (i=0; i<nlocks1; i+=UNROLL) {
    vu lock = *(vu *)(locks1+i);
    for (j=0; j<nkeys1; j++) {
    part0 -= (lock & keys1[j])==0;
    }

    One interesting aspect of the gcc vector extensions is that the result
    of comparing two vectors is 0 (false) or ~0 (true) (per element),
    whereas for scalars the value for true is 1. Therefore the code above updates part0 with -=, whereas in keylocks2.c += is used.

    While the use of ~0 is a good choice when designing a new programming language, I would have gone for 1 in the case of a vector extension
    for C, for consistency with the scalar case; in combination with
    hardware that produces ~0 (e.g., Intel SSE and AVX SIMD stuff), that
    means that the compiler will introduce a negation in its intermediate representation at some point; I expect that compilers will usually be
    able to optimize this negation away, but I would not be surprised at
    cases where my expectation is disappointed.

    keylocks3.c compiles without warning on clang, but the result usually segfaults (but sometime does not, e.g., in the timed run on Zen4; it segfaults in other runs on Zen4). I have not investigated why this
    happens, I just did not include results from runs where it segfaulted;
    and I tried additional runs for keylocks3-512 on Zen4 in order to have
    one result there.

    I would have liked to compare the performance of my code against your
    code, but your code apparently was destroyed by arbitrary line
    breaking in your news-posting software. Anyway, here are my results.
    First cycles (which eliminates worries about turbo modes) and
    instructions, then usec/call.

    The cores are:

    Haswell: Core i7-4790K (similar to Michael S.'s CPU)
    Golden Cove: Core i3-1315U (same P-core as Terje Mathisen's laptop) Gracemont: Core i3-1315U (same E-core as Terje Mathisen's laptop)
    Rocket Lake: Xeon W-1370P (2021 Intel CPU)
    Zen4: Ryzen 7 8700G (2024)

    I also measured Tiger Lake (Core i5-1135G7, the CPU of a 2021 laptop),
    but the results were very close to the Rocket Lake results, so because
    of the limited table width, I do not show them.

    The first three cores do not support AVX-512, the others do.

    Cycles:
    Haswell Golden Cove Gracemont Rocket Lake Zen4
    1818_241431 1433_208539 1778_675623 2_365_664_737 1_677_853_186 gcc avx2 1 1051_191216 1189_869807 1872_856423 981_948_517 727_418_069 gcc avx2 2 8
    1596_783872 1213_400891 2076_677426 1_690_280_182 913_746_088 gcc avx2 3 8
    2195_438821 1638_006130 2577_451872 2_291_743_879 1_617_970_157 clang avx2 1 2757_454335 2151_198125 2506_427284 3_174_899_185 1_523_870_829 clang avx2 2 8?
    638_374_463 clang avx2 3 8?
    1_139_175_457 1_219_164_672 gcc 512 1
    856_818_642 900_108_135 gcc 512 2 32
    866_077_594 1_072_172_449 gcc 512 3 16
    2_479_213_408 1_479_937_930 clang 512 1
    912_273706 936_311567 847_289_380 634_826_441 clang 512 2 16?
    636_278_210 clang 512 3 16?

    avx2 means: compiled with -mavx2; 512 means: compiled with
    -march=x86-64-v4 (I usually did not measure those on machines that do
    not support AVX-512, because I expected the results to not work; I
    later measured some clang's keylocks2-512 on some of those machines).
    The number behind that ist the keylocks[123].c variant, and the number
    behind that (if present) the UNROLL parameter. I am not sure about
    the UNROLL numbers used for clang, but in any case I kept what
    performed best on Rocket Lake. The number of instructions executed is (reported on the Zen4):

    instructions
    5_779_542_242 gcc avx2 1
    3_484_942_148 gcc avx2 2 8
    5_885_742_164 gcc avx2 3 8
    7_903_138_230 clang avx2 1
    7_743_938_183 clang avx2 2 8?
    3_625_338_104 clang avx2 3 8?
    4_204_442_194 gcc 512 1
    2_564_142_161 gcc 512 2 32
    3_061_042_178 gcc 512 3 16
    7_703_938_205 clang 512 1
    3_402_238_102 clang 512 2 16?
    3_320_455_741 clang 512 3 16?

    for gcc -mavx2 on keylocks3.c on Zen 4 an IPC of 6.44 is reported,
    while microarchitecture descriptions report only a 6-wide renamer <https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine>.
    My guess is that the front end combined some instructions (maybe
    compare and branch) into a macro-op, and the renamer then processed 6 macro-ops that represented more instructions. The inner loop is

    │190: vpbroadcastd (%rax),%ymm0
    1.90 │ add $0x4,%rax
    │ vpand %ymm2,%ymm0,%ymm0
    1.09 │ vpcmpeqd %ymm3,%ymm0,%ymm0
    0.41 │ vpsubd %ymm0,%ymm1,%ymm1
    78.30 │ cmp %rdx,%rax
    │ jne 190

    and if the cmp and jne are combined into one macro-op, that would be
    perfect for executing one iteration per cycle.

    It's interesting that gcc's keylocks2-256 results on far fewer
    instructions (and eventually, cycles). It unrolls the inner loop 8
    times to process the keys in SIMD fashion, too, loading the keys one
    ymm register at a time. In order to do that it arranges the locks in
    8 different ymm registers in the outer loop, so the inner loop
    performs 8 sequences similar to

    vpand %ymm0,%ymm15,%ymm2
    vpcmpeqd %ymm1,%ymm2,%ymm2
    vpsubd %ymm2,%ymm4,%ymm4

    surrounded by

    300: vmovdqu (%rsi),%ymm0
    add $0x20,%rsi
    [8 3-instruction sequences]
    cmp %rsi,%rdx
    jne 300

    It also uses 8 ymm accumulators, so not all of that fits into
    registers, so three of the anded values are stored on the stack. For
    Zen4 this could be improved by using only 2 accumulators. In any
    case, the gcc people did something clever here, and I do not
    understand how they got there from the source code, and why they did
    not get there from keylocks1.c.

    For clang's keylocks3-256 the inner loop and the outer loop are each
    unrolled two times, resulting in and inner loop like:

    190: vpbroadcastd (%r12,%rbx,4),%ymm5
    vpand %ymm3,%ymm5,%ymm6
    vpand %ymm4,%ymm5,%ymm5
    vpcmpeqd %ymm1,%ymm5,%ymm5
    vpsubd %ymm5,%ymm2,%ymm2
    vpcmpeqd %ymm1,%ymm6,%ymm5
    vpsubd %ymm5,%ymm0,%ymm0
    vpbroadcastd 0x4(%r12,%rbx,4),%ymm5
    vpand %ymm4,%ymm5,%ymm6
    vpand %ymm3,%ymm5,%ymm5
    vpcmpeqd %ymm1,%ymm5,%ymm5
    vpsubd %ymm5,%ymm0,%ymm0
    vpcmpeqd %ymm1,%ymm6,%ymm5
    vpsubd %ymm5,%ymm2,%ymm2
    add $0x2,%rbx
    cmp %rbx,%rsi
    jne 190

    This results in the lowest AVX2 cycles, and I expect that one can use
    that approach without crash problems without adding too many cycles.
    The clang -march=x86-64-v4 results have similar code (with twice as
    much inner-loop unrolling in case of keylocks3-512), but they all only
    use AVX2 instructions and there have been successful runs on a Zen2
    (which does not support AVX-512). It seems that clang does not
    support AVX-512, or it does not understand -march=x86-64-v4 to allow
    more than AVX2.

    The least executed instructions is with gcc's keylocks2-512, where the
    inner loop is:

    230: vpbroadcastd 0x4(%rax),%zmm4
    vpbroadcastd (%rax),%zmm0
    mov %edx,%r10d
    add $0x8,%rax
    add $0x2,%edx
    vpandd %zmm4,%zmm8,%zmm5
    vpandd %zmm0,%zmm8,%zmm9
    vpandd %zmm4,%zmm6,%zmm4
    vptestnmd %zmm5,%zmm5,%k1
    vpandd %zmm0,%zmm6,%zmm0
    vmovdqa32 %zmm7,%zmm5{%k1}{z}
    vptestnmd %zmm9,%zmm9,%k1
    vmovdqa32 %zmm3,%zmm9{%k1}{z}
    vptestnmd %zmm4,%zmm4,%k1
    vpsubd %zmm9,%zmm5,%zmm5
    vpaddd %zmm5,%zmm2,%zmm2
    vmovdqa32 %zmm7,%zmm4{%k1}{z}
    vptestnmd %zmm0,%zmm0,%k1
    vmovdqa32 %zmm3,%zmm0{%k1}{z}
    vpsubd %zmm0,%zmm4,%zmm0
    vpaddd %zmm0,%zmm1,%zmm1
    cmp %r10d,%r8d
    jne 230

    Due to UNROLL=32, it deals with 2 zmm registers coming from the outer
    loop at a time, and the inner loop is unrolled by a factor of 2, too.
    It uses vptestnmd and a predicated vmovdqa32 instead of using vpcmpeqd (why?). Anyway, the code seems to rub Zen4 the wrong way, and it
    performs only at 2.84 IPC, worse than the AVX2 code. Rocket Lake
    performs slightly better, but still, the clang code for keylocks2-512
    runs a bit faster without using AVX-512.

    I also saw one case where the compiler botched it:

    gcc -Wall -DUNROLL=16 -O3 -mavx2 -c keylocks3.c

    [/tmp/keylock:155546] LC_NUMERIC=prog perf stat -e cycles -e instructions keylocks3-256
    603800000

    Performance counter stats for 'keylocks3-256':

    17_476_700_581 cycles
    39_480_242_683 instructions # 2.26 insn per cycle

    3.506995312 seconds time elapsed

    3.507020000 seconds user
    0.000000000 seconds sys

    (cycles and timings on the 8700G). Here the compiler failed to
    vectorize the comparison, and performed them using scalar instructions
    (first extracting the data from the SIMD registers, and finally
    inserting the result into SIMD registers, with additional overhead
    from spilling registers). The result requires about 10 times more instructions than the UNROLL=8 variant and almost 20 times more
    cycles.

    On to timings per routine invocation:

    On a 4.4Ghz Haswell (whereas Michael S. measured a 4GHz Haswell):
    5.47us clang keylocks1-256 (5.66us for Michael S.'s "original code")
    4.26us gcc keylocks1-256 (5.66us for Michael S.'s "original code")
    2.38us gcc keylocks2-256 (2.18us for Michael S.'s manual vectorized code) 2.08us clang keylocks2-512 (2.18us for Michael S.'s manual vectorized code)

    Michael S.'s "original code" performs similar on clang to my
    keylocks1.c. clang's keylocks2-512 code is quite competetive with his
    manual code.

    On the Golden Cove of a Core i3-1315U (compared to the best result by
    Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
    according to Intel, whereas the former can supposedly run up to
    4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
    as well):

    5.25us Terje Mathisen's Rust code compiled by clang (best on the 1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
    4.17us gcc keylocks1-256 on a 3.8GHz 1315U
    3.16us gcc keylocks2-256 on a 3.8GHz 1315U
    2.38us clang keylocks2-512 on a 3.8GHz 1315U

    I would have expected the clang keylocks1-256 to run slower, because
    the compiler back-end is the same and the 1315U is slower. Measuring
    cycles looks more relevant for this benchmark to me than measuring
    time, especially on this core where AVX-512 is disabled and there is
    no AVX slowdown.

    - anton



    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Sat Feb 8 20:02:49 2025
    On Fri, 7 Feb 2025 22:27:03 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:



    Simple question:: how would you port this code to a machine
    with a different SIMD instruction set ??


    In this case, one possibility is to write it differently in the first
    place.

    May be, I'd use completely different programming model, like implicit
    SPMD. It was promoted by Intel ispc compiler. I tried it once, it
    works and it's cross-platform. I am not sure that this specific
    compiler (ispc) is still supported, but there should be others of the
    same kind.

    Or I can use some sort of SIMD portable toolkit. There are several,
    with somewhat different levels of abstraction. The developer of one of
    them used to post on RWT forum.
    gcc vector extensions, mentioned in post above by Anton Ertl, is
    another ideologically similar possibility.

    For me, personally, all those options are harder to use than intrinsic functions. So, there is the third and the most likely way - to learn
    new set of intrinsic functions and to rewrite manually. The routines
    tend to be small, so it's likely faster than fighting with tools. And
    it's not like I would ever have to port to a dozen of different SIMD architectures in 3 months.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sat Feb 8 19:21:19 2025
    On Sat, 08 Feb 2025 08:11:04 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    In the mean time.
    I did few measurements on Xeon E3 1271 v3. That is rather old uArch - >Haswell, the first core that supports AVX2. During the tests it was
    running at 4.0 GHz.

    1. Original code (rewritten in plain C) compiled with clang -O3 >-march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
    compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
    AVX2 code compiled with clang -O3 -march=skylake (AVX2)

    Results were as following (usec/call)
    1 - 5.66
    2 - 5.56
    3 - 2.18

    In the meantime, I also wrote the original code in plain C
    (keylocks1.c), then implemented your idea of unrolling the outer loop
    and comparing a subarray of locks to each key (is this called strip
    mining?) in plain C (with the hope that auto-vectorization works) as keylocks2.c, and finally rewrote the latter version to use gcc vector extensions (keylocks3.c). I wrote a dummy main around that that calls
    the routine 100_000 times; given that the original routine's
    performance does not depend on the data, and I used non-0 keys (so keylocks[23].c does not skip any keys), the actual data is not
    important.


    I used keys filled with pseudo-random bits with probability of 0 =
    0.705. The probability was chosen to get final results similar to
    Terje's.


    You can find the source code and the binaries I measured at <http://www.complang.tuwien.ac.at/anton/keylock/>. The binaries were compiled with gcc 12.2.0 and (in the clang subdirectory) clang-14.0.6;
    the clang compilations sometimes used different UNROLL factors than
    the gcc compilations (and I am unsure, which, see below).

    The original code is:

    unsigned keylocks(unsigned keys[], unsigned nkeys, unsigned locks[],
    unsigned nlocks) {
    unsigned i, j;
    unsigned part1 = 0;
    for (i=0; i<nlocks; i++) {
    unsigned lock = locks[i];
    for (j=0; j<nkeys; j++)
    part1 += (lock & keys[j])==0;
    }
    return part1;
    }

    For keylocks2.c the central loops are:

    for (i=0; i<UNROLL; i++)
    part0[i]=0;
    for (i=0; i<nlocks1; i+=UNROLL) {
    for (j=0; j<nkeys1; j++) {
    unsigned key = keys1[j];
    for (k=0; k<UNROLL; k++)
    part0[k] += (locks1[i+k] & key)==0;
    }
    }

    For UNROLL I tried 8, 16, and 32 for AVX2 and 16, 32, or 64 for
    AVX-512; the numbers below are for those factors that produce the
    lowest cycles on the Rocket Lake machine.

    The central loops are preceded by code to arrange the data such that
    this code works: locks are copied to the longer locks1; the length of
    locks1 is a multiple of UNROLL, and the entries beyond nlocks are ~0
    to increase the count by 0) and the keys are copies to keys1 (with 0
    removed so that the extra locks are not counted, and that also may
    increase efficiency if there is a key=0). The central loops are
    followed by summing up the elements of part0.

    keylocks3.c, which uses the gcc vector extensions, just changes
    keylocks2.c in a few places. In particular, it adds a type vu:

    typedef unsigned vu __attribute__ ((vector_size
    (UNROLL*sizeof(unsigned))));

    The central loops now look as follows:

    for (i=0; i<UNROLL; i++)
    part0[i]=0;
    for (i=0; i<nlocks1; i+=UNROLL) {
    vu lock = *(vu *)(locks1+i);
    for (j=0; j<nkeys1; j++) {
    part0 -= (lock & keys1[j])==0;
    }

    One interesting aspect of the gcc vector extensions is that the result
    of comparing two vectors is 0 (false) or ~0 (true) (per element),
    whereas for scalars the value for true is 1. Therefore the code above updates part0 with -=, whereas in keylocks2.c += is used.

    While the use of ~0 is a good choice when designing a new programming language, I would have gone for 1 in the case of a vector extension
    for C, for consistency with the scalar case; in combination with
    hardware that produces ~0 (e.g., Intel SSE and AVX SIMD stuff), that
    means that the compiler will introduce a negation in its intermediate representation at some point; I expect that compilers will usually be
    able to optimize this negation away, but I would not be surprised at
    cases where my expectation is disappointed.

    keylocks3.c compiles without warning on clang, but the result usually segfaults (but sometime does not, e.g., in the timed run on Zen4; it segfaults in other runs on Zen4). I have not investigated why this
    happens, I just did not include results from runs where it segfaulted;
    and I tried additional runs for keylocks3-512 on Zen4 in order to have
    one result there.

    I would have liked to compare the performance of my code against your
    code, but your code apparently was destroyed by arbitrary line
    breaking in your news-posting software.

    Or by my own pasting mistake. I am still not sure whom to blame.
    The mistake was tiny - absence of // at the begining of one line, but
    enough to not compile. Trying it for a second time:

    #include <stdint.h>
    #include <immintrin.h>

    int foo_tst(const uint32_t* keylocks, int len, int li)
    {
    if (li >= len || li <= 0)
    return 0;
    const uint32_t* keyx = &keylocks[li];
    unsigned ni = len - li;
    __m256i res0 = _mm256_setzero_si256();
    __m256i res1 = _mm256_setzero_si256();
    __m256i res2 = _mm256_setzero_si256();
    __m256i res3 = _mm256_setzero_si256();
    const uint32_t* keyx_last = &keyx[ni & -32];
    for (; keyx != keyx_last; keyx += 32) {
    __m256i lock0 = _mm256_loadu_si256((const __m256i*)&keyx[0*8]);
    __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyx[1*8]);
    __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyx[2*8]);
    __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyx[3*8]);
    for (const uint32_t* keyy = keylocks; keyy != &keylocks[li];
    ++keyy) { __m256i lockk =
    _mm256_castps_si256(_mm256_broadcast_ss((const float*)keyy)); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
    lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } } int res = 0; if (ni % 32) { uint32_t
    tmp[32]; const uint32_t* keyy_last = &keylocks[li & -32];
    if (li % 32) {
    for (int k = 0; k < li % 32; ++k)
    tmp[k] = keyy_last[k];
    for (int k = li % 32; k < 32; ++k)
    tmp[k] = (uint32_t)-1;
    }
    const uint32_t* keyx_last = &keyx[ni % 32];
    int nz = 0;
    for (; keyx != keyx_last; keyx += 1) {
    if (*keyx) {
    __m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const float*)keyx)); for (const uint32_t* keyy = keylocks; keyy != keyy_last;
    keyy += 32) { __m256i lock0 = _mm256_loadu_si256((const
    __m256i*)&keyy[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyy[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyy[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyy[3*8]); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
    _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } if (li % 32) { __m256i lock0 = _mm256_loadu_si256((const __m256i*)&tmp[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&tmp[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&tmp[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&tmp[3*8]); res0 =
    _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
    lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } } else { nz += 1; } } res = nz * li; }
    // fold accumulators
    res0 = _mm256_add_epi32(res0, res2);
    res1 = _mm256_add_epi32(res1, res3);
    res0 = _mm256_add_epi32(res0, res1);
    res0 = _mm256_hadd_epi32(res0, res0);
    res0 = _mm256_hadd_epi32(res0, res0);

    res += _mm256_extract_epi32(res0, 0);
    res += _mm256_extract_epi32(res0, 4);
    return res;
    }


    Anyway, here are my results.
    First cycles (which eliminates worries about turbo modes) and
    instructions, then usec/call.


    I don't understand that.
    For original code optimized by clang I'd expect 22,000 cycles and 5.15
    usec per call on Haswell. You numbers don't even resamble anything like
    that.

    The cores are:


    <snip>

    The number of instructions executed is
    (reported on the Zen4):

    instructions
    5_779_542_242 gcc avx2 1
    3_484_942_148 gcc avx2 2 8
    5_885_742_164 gcc avx2 3 8
    7_903_138_230 clang avx2 1
    7_743_938_183 clang avx2 2 8?
    3_625_338_104 clang avx2 3 8?
    4_204_442_194 gcc 512 1
    2_564_142_161 gcc 512 2 32
    3_061_042_178 gcc 512 3 16
    7_703_938_205 clang 512 1
    3_402_238_102 clang 512 2 16?
    3_320_455_741 clang 512 3 16?


    I don't understand these numbers either. For original clang, I'd expect
    25,000 instructions per call. Or 33,000 if, unlike in Terje's Rust
    case, your clang generates RISC-style sequences. Your number is somehow 240,000 times bigger.

    for gcc -mavx2 on keylocks3.c on Zen 4 an IPC of 6.44 is reported,
    while microarchitecture descriptions report only a 6-wide renamer <https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine>.
    My guess is that the front end combined some instructions (maybe
    compare and branch) into a macro-op, and the renamer then processed 6 macro-ops that represented more instructions. The inner loop is

    │190: vpbroadcastd (%rax),%ymm0
    1.90 │ add $0x4,%rax
    │ vpand %ymm2,%ymm0,%ymm0
    1.09 │ vpcmpeqd %ymm3,%ymm0,%ymm0
    0.41 │ vpsubd %ymm0,%ymm1,%ymm1
    78.30 │ cmp %rdx,%rax
    │ jne 190

    and if the cmp and jne are combined into one macro-op, that would be
    perfect for executing one iteration per cycle.

    It's interesting that gcc's keylocks2-256 results on far fewer
    instructions (and eventually, cycles). It unrolls the inner loop 8
    times to process the keys in SIMD fashion, too, loading the keys one
    ymm register at a time. In order to do that it arranges the locks in
    8 different ymm registers in the outer loop, so the inner loop
    performs 8 sequences similar to

    vpand %ymm0,%ymm15,%ymm2
    vpcmpeqd %ymm1,%ymm2,%ymm2
    vpsubd %ymm2,%ymm4,%ymm4

    surrounded by

    300: vmovdqu (%rsi),%ymm0
    add $0x20,%rsi
    [8 3-instruction sequences]
    cmp %rsi,%rdx
    jne 300

    It also uses 8 ymm accumulators, so not all of that fits into
    registers, so three of the anded values are stored on the stack. For
    Zen4 this could be improved by using only 2 accumulators. In any
    case, the gcc people did something clever here, and I do not
    understand how they got there from the source code, and why they did
    not get there from keylocks1.c.

    For clang's keylocks3-256 the inner loop and the outer loop are each
    unrolled two times, resulting in and inner loop like:

    190: vpbroadcastd (%r12,%rbx,4),%ymm5
    vpand %ymm3,%ymm5,%ymm6
    vpand %ymm4,%ymm5,%ymm5
    vpcmpeqd %ymm1,%ymm5,%ymm5
    vpsubd %ymm5,%ymm2,%ymm2
    vpcmpeqd %ymm1,%ymm6,%ymm5
    vpsubd %ymm5,%ymm0,%ymm0
    vpbroadcastd 0x4(%r12,%rbx,4),%ymm5
    vpand %ymm4,%ymm5,%ymm6
    vpand %ymm3,%ymm5,%ymm5
    vpcmpeqd %ymm1,%ymm5,%ymm5
    vpsubd %ymm5,%ymm0,%ymm0
    vpcmpeqd %ymm1,%ymm6,%ymm5
    vpsubd %ymm5,%ymm2,%ymm2
    add $0x2,%rbx
    cmp %rbx,%rsi
    jne 190

    This results in the lowest AVX2 cycles, and I expect that one can use
    that approach without crash problems without adding too many cycles.
    The clang -march=x86-64-v4 results have similar code (with twice as
    much inner-loop unrolling in case of keylocks3-512), but they all only
    use AVX2 instructions and there have been successful runs on a Zen2
    (which does not support AVX-512). It seems that clang does not
    support AVX-512, or it does not understand -march=x86-64-v4 to allow
    more than AVX2.

    The least executed instructions is with gcc's keylocks2-512, where the
    inner loop is:

    230: vpbroadcastd 0x4(%rax),%zmm4
    vpbroadcastd (%rax),%zmm0
    mov %edx,%r10d
    add $0x8,%rax
    add $0x2,%edx
    vpandd %zmm4,%zmm8,%zmm5
    vpandd %zmm0,%zmm8,%zmm9
    vpandd %zmm4,%zmm6,%zmm4
    vptestnmd %zmm5,%zmm5,%k1
    vpandd %zmm0,%zmm6,%zmm0
    vmovdqa32 %zmm7,%zmm5{%k1}{z}
    vptestnmd %zmm9,%zmm9,%k1
    vmovdqa32 %zmm3,%zmm9{%k1}{z}
    vptestnmd %zmm4,%zmm4,%k1
    vpsubd %zmm9,%zmm5,%zmm5
    vpaddd %zmm5,%zmm2,%zmm2
    vmovdqa32 %zmm7,%zmm4{%k1}{z}
    vptestnmd %zmm0,%zmm0,%k1
    vmovdqa32 %zmm3,%zmm0{%k1}{z}
    vpsubd %zmm0,%zmm4,%zmm0
    vpaddd %zmm0,%zmm1,%zmm1
    cmp %r10d,%r8d
    jne 230

    Due to UNROLL=32, it deals with 2 zmm registers coming from the outer
    loop at a time, and the inner loop is unrolled by a factor of 2, too.
    It uses vptestnmd and a predicated vmovdqa32 instead of using vpcmpeqd (why?). Anyway, the code seems to rub Zen4 the wrong way, and it
    performs only at 2.84 IPC, worse than the AVX2 code. Rocket Lake
    performs slightly better, but still, the clang code for keylocks2-512
    runs a bit faster without using AVX-512.

    I also saw one case where the compiler botched it:

    gcc -Wall -DUNROLL=16 -O3 -mavx2 -c keylocks3.c

    [/tmp/keylock:155546] LC_NUMERIC=prog perf stat -e cycles -e
    instructions keylocks3-256 603800000

    Performance counter stats for 'keylocks3-256':

    17_476_700_581 cycles
    39_480_242_683 instructions # 2.26 insn
    per cycle

    3.506995312 seconds time elapsed

    3.507020000 seconds user
    0.000000000 seconds sys

    (cycles and timings on the 8700G). Here the compiler failed to
    vectorize the comparison, and performed them using scalar instructions
    (first extracting the data from the SIMD registers, and finally
    inserting the result into SIMD registers, with additional overhead
    from spilling registers). The result requires about 10 times more instructions than the UNROLL=8 variant and almost 20 times more
    cycles.

    On to timings per routine invocation:

    On a 4.4Ghz Haswell (whereas Michael S. measured a 4GHz Haswell):
    5.47us clang keylocks1-256 (5.66us for Michael S.'s "original code")
    4.26us gcc keylocks1-256 (5.66us for Michael S.'s "original code")
    2.38us gcc keylocks2-256 (2.18us for Michael S.'s manual vectorized
    code) 2.08us clang keylocks2-512 (2.18us for Michael S.'s manual
    vectorized code)

    Michael S.'s "original code" performs similar on clang to my
    keylocks1.c. clang's keylocks2-512 code is quite competetive with his
    manual code.


    Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
    Which could be due to differences in measurements methodology - I
    reported median of 11 runs, you seems to report average.

    On the Golden Cove of a Core i3-1315U (compared to the best result by
    Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
    according to Intel, whereas the former can supposedly run up to
    4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
    as well):


    I always thought that NUCs have better cooling than all, but high-end
    laptops. Was I wrong? Such slowness is disappointing.

    5.25us Terje Mathisen's Rust code compiled by clang (best on the
    1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
    4.17us gcc keylocks1-256 on a 3.8GHz 1315U
    3.16us gcc keylocks2-256 on a 3.8GHz 1315U
    2.38us clang keylocks2-512 on a 3.8GHz 1315U


    So, for the best-performing variant IPC of Goldeen Cove is identical to
    ancient Haswell? That's very disappointing. Haswell has 4-wide front
    end and majority of AVX2 integer instruction is limited to throughput
    of two per clock. Golden Cove has 5+ wide front end and nearly all AVX2
    integer instruction have throughput of three per clock.
    Could it be that clang introduced some sort of latency bottleneck? May
    be, a single accumulator? If it is the case, my code should run
    about the same as clang's on resource-starved Haswell, but measurably
    faster on Goldden Cove.

    I would have expected the clang keylocks1-256 to run slower, because
    the compiler back-end is the same and the 1315U is slower. Measuring
    cycles looks more relevant for this benchmark to me than measuring
    time, especially on this core where AVX-512 is disabled and there is
    no AVX slowdown.


    I prefer time, because at the end it's the only thing that matter.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sat Feb 8 17:46:32 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Sat, 08 Feb 2025 08:11:04 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Or by my own pasting mistake. I am still not sure whom to blame.
    The mistake was tiny - absence of // at the begining of one line, but
    enough to not compile. Trying it for a second time:

    Now it's worse, it's quoted-printable. E.g.:

    if (li >=3D len || li <=3D 0)

    Some newsreaders can decode this, mine does not.

    First cycles (which eliminates worries about turbo modes) and
    instructions, then usec/call.
    =20

    I don't understand that.
    For original code optimized by clang I'd expect 22,000 cycles and 5.15
    usec per call on Haswell. You numbers don't even resamble anything like
    that.

    My cycle numbers are for the whole program that calls keylocks()
    100_000 times.

    If you divide the cycles by 100000, you get 21954 for clang
    keylocks1-256, which is what you expect.

    instructions
    5_779_542_242 gcc avx2 1 =20
    3_484_942_148 gcc avx2 2 8=20
    5_885_742_164 gcc avx2 3 8=20
    7_903_138_230 clang avx2 1 =20
    7_743_938_183 clang avx2 2 8?
    3_625_338_104 clang avx2 3 8?=20
    4_204_442_194 gcc 512 1 =20
    2_564_142_161 gcc 512 2 32
    3_061_042_178 gcc 512 3 16
    7_703_938_205 clang 512 1 =20
    3_402_238_102 clang 512 2 16?
    3_320_455_741 clang 512 3 16?
    =20

    I don't understand these numbers either. For original clang, I'd expect >25,000 instructions per call.

    clang keylocks1-256 performs 79031 instructions per call (divide the
    number given by 100000 calls). If you want to see why that is, you
    need to analyse the code produced by clang, which I did only for
    select cases.

    Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
    Which could be due to differences in measurements methodology - I
    reported median of 11 runs, you seems to report average.

    I just report one run with 100_000 calls, and just hope that the
    variation is small:-) In my last refereed paper I use 30 runs and
    median, but I don't go to these lengths here; the cycles seem pretty repeatable.

    On the Golden Cove of a Core i3-1315U (compared to the best result by
    Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
    according to Intel, whereas the former can supposedly run up to
    4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
    as well):
    =20

    I always thought that NUCs have better cooling than all, but high-end >laptops. Was I wrong? Such slowness is disappointing.

    The cooling may be better or not, that does not come into play here,
    as it never reaches higher clocks, even when it's cold; E-cores also
    stay 700MHz below their rated turbo speed, even when it's the only
    loaded core. One theory I have is that one option we set up in the
    BIOS has the effect of limiting turbo speed, but it has not been
    important enough to test.

    5.25us Terje Mathisen's Rust code compiled by clang (best on the
    1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
    4.17us gcc keylocks1-256 on a 3.8GHz 1315U
    3.16us gcc keylocks2-256 on a 3.8GHz 1315U
    2.38us clang keylocks2-512 on a 3.8GHz 1315U
    =20

    So, for the best-performing variant IPC of Goldeen Cove is identical to >ancient Haswell?

    Actually worse:

    For clang keylocks2-512 Haswell has 3.73 IPC, Golden Cove 3.63.

    That's very disappointing. Haswell has 4-wide front
    end and majority of AVX2 integer instruction is limited to throughput
    of two per clock. Golden Cove has 5+ wide front end and nearly all AVX2 >integer instruction have throughput of three per clock.
    Could it be that clang introduced some sort of latency bottleneck?

    As far as I looked into the code, I did not see such a bottleneck.
    Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
    clang keylocks2-256), and I expect that it would suffer from a general
    latency bottleneck, too. Rocket Lake is also faster on this program
    than Haswell and Golden Cove. It seems to be just that this program
    rubs Golden Cove the wrong way.

    I would have expected the clang keylocks1-256 to run slower, because
    the compiler back-end is the same and the 1315U is slower. Measuring
    cycles looks more relevant for this benchmark to me than measuring
    time, especially on this core where AVX-512 is disabled and there is
    no AVX slowdown.
    =20

    I prefer time, because at the end it's the only thing that matter.

    True, and certainly, when stuff like AVX-512 license-based
    downclocking or thermal or power limits come into play (and are
    relevant for the measurement at hand), one has to go there. But then
    you can only compare code running on the same kind of machine,
    configured the same way. Or maybe just running on the same
    machine:-). But then, the generality of the results is questionable.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sat Feb 8 20:42:50 2025
    On Sat, 08 Feb 2025 17:46:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sat, 08 Feb 2025 08:11:04 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Or by my own pasting mistake. I am still not sure whom to blame.
    The mistake was tiny - absence of // at the begining of one line, but >enough to not compile. Trying it for a second time:

    Now it's worse, it's quoted-printable. E.g.:

    if (li >=3D len || li <=3D 0)

    Some newsreaders can decode this, mine does not.


    There is always novabbs to rescue: https://www.novabbs.com/devel/article-flat.php?id=44334&group=comp.arch#44334

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sun Feb 9 02:57:45 2025
    On Sat, 08 Feb 2025 17:46:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sat, 08 Feb 2025 08:11:04 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Or by my own pasting mistake. I am still not sure whom to blame.
    The mistake was tiny - absence of // at the begining of one line, but >enough to not compile. Trying it for a second time:

    Now it's worse, it's quoted-printable. E.g.:

    if (li >=3D len || li <=3D 0)

    Some newsreaders can decode this, mine does not.

    First cycles (which eliminates worries about turbo modes) and
    instructions, then usec/call.
    =20

    I don't understand that.
    For original code optimized by clang I'd expect 22,000 cycles and
    5.15 usec per call on Haswell. You numbers don't even resamble
    anything like that.

    My cycle numbers are for the whole program that calls keylocks()
    100_000 times.

    If you divide the cycles by 100000, you get 21954 for clang
    keylocks1-256, which is what you expect.

    instructions
    5_779_542_242 gcc avx2 1 =20
    3_484_942_148 gcc avx2 2 8=20
    5_885_742_164 gcc avx2 3 8=20
    7_903_138_230 clang avx2 1 =20
    7_743_938_183 clang avx2 2 8?
    3_625_338_104 clang avx2 3 8?=20
    4_204_442_194 gcc 512 1 =20
    2_564_142_161 gcc 512 2 32
    3_061_042_178 gcc 512 3 16
    7_703_938_205 clang 512 1 =20
    3_402_238_102 clang 512 2 16?
    3_320_455_741 clang 512 3 16?
    =20

    I don't understand these numbers either. For original clang, I'd
    expect 25,000 instructions per call.

    clang keylocks1-256 performs 79031 instructions per call (divide the
    number given by 100000 calls). If you want to see why that is, you
    need to analyse the code produced by clang, which I did only for
    select cases.

    Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
    Which could be due to differences in measurements methodology - I
    reported median of 11 runs, you seems to report average.

    I just report one run with 100_000 calls, and just hope that the
    variation is small:-) In my last refereed paper I use 30 runs and
    median, but I don't go to these lengths here; the cycles seem pretty repeatable.

    On the Golden Cove of a Core i3-1315U (compared to the best result
    by Terje Mathisen on a Core i7-1365U; the latter can run up to
    5.2GHz according to Intel, whereas the former can supposedly run
    up to 4.5GHz; I only ever measured at most 3.8GHz on our NUC, and
    this time as well):
    =20

    I always thought that NUCs have better cooling than all, but high-end >laptops. Was I wrong? Such slowness is disappointing.

    The cooling may be better or not, that does not come into play here,
    as it never reaches higher clocks, even when it's cold; E-cores also
    stay 700MHz below their rated turbo speed, even when it's the only
    loaded core. One theory I have is that one option we set up in the
    BIOS has the effect of limiting turbo speed, but it has not been
    important enough to test.

    5.25us Terje Mathisen's Rust code compiled by clang (best on the
    1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
    4.17us gcc keylocks1-256 on a 3.8GHz 1315U
    3.16us gcc keylocks2-256 on a 3.8GHz 1315U
    2.38us clang keylocks2-512 on a 3.8GHz 1315U
    =20

    So, for the best-performing variant IPC of Goldeen Cove is identical
    to ancient Haswell?

    Actually worse:

    For clang keylocks2-512 Haswell has 3.73 IPC, Golden Cove 3.63.

    That's very disappointing. Haswell has 4-wide front
    end and majority of AVX2 integer instruction is limited to throughput
    of two per clock. Golden Cove has 5+ wide front end and nearly all
    AVX2 integer instruction have throughput of three per clock.
    Could it be that clang introduced some sort of latency bottleneck?

    As far as I looked into the code, I did not see such a bottleneck.
    Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
    clang keylocks2-256), and I expect that it would suffer from a general latency bottleneck, too. Rocket Lake is also faster on this program
    than Haswell and Golden Cove. It seems to be just that this program
    rubs Golden Cove the wrong way.

    I would have expected the clang keylocks1-256 to run slower,
    because the compiler back-end is the same and the 1315U is slower.
    Measuring cycles looks more relevant for this benchmark to me
    than measuring time, especially on this core where AVX-512 is
    disabled and there is no AVX slowdown.
    =20

    I prefer time, because at the end it's the only thing that matter.

    True, and certainly, when stuff like AVX-512 license-based
    downclocking or thermal or power limits come into play (and are
    relevant for the measurement at hand), one has to go there. But then
    you can only compare code running on the same kind of machine,
    configured the same way. Or maybe just running on the same
    machine:-). But then, the generality of the results is questionable.

    - anton

    Back to original question of the cost of misalignment.
    I modified original code to force alignment in the inner loop:

    #include <stdint.h>
    #include <string.h>

    int foo_tst(const uint32_t* keylocks, int len, int li)
    {
    if (li <= 0 || len <= li)
    return 0;

    int lix = (li + 31) & -32;
    _Alignas(32) uint32_t tmp[lix];
    memcpy(tmp, keylocks, li*sizeof(*keylocks));
    if (lix > li)
    memset(&tmp[li], 0, (lix-li)*sizeof(*keylocks));

    int res = 0;
    for (int i = li; i < len; ++i) {
    uint32_t lock = keylocks[i];
    for (int k = 0; k < lix; ++k)
    res += (lock & tmp[k])==0;
    }
    return res - (lix-li)*(len-li);
    }

    Compiled with 'clang -O3 -march=haswell'
    On the same Haswell Xeon it runs at 2.841 usec/call, i.e. almost
    twice faster than original and only 1.3x slower than horizontally
    unrolled variants.

    So, at least on Haswell, unaligned AVX256 loads are slow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB on Sat Feb 22 10:16:18 2025
    BGB wrote:
    On 2/21/2025 1:51 PM, EricP wrote:
    BGB wrote:

    Can note that the latency of carry-select adders is a little weird:
    16/32/64: Latency goes up steadily;
    But, still less than linear;
    128-bit: Only slightly more latency than 64-bit.

    The best I could find in past testing was seemingly 16-bit chunks for
    normal adding. Where, 16-bits seemed to be around the break-even
    between the chained CARRY4's and the Carry-Select (CS being slower
    below 16 bits).

    But, for a 64-bit adder, still basically need to give it a
    clock-cycle to do its thing. Though, not like 32 is particularly fast
    either; hence part of the whole 2 cycle latency on ALU ops thing.
    Mostly has to do with ADD/SUB (and CMP, which is based on SUB).


    Admittedly part of why I have such mixed feelings on full
    compare-and- branch:
    Pro: It can offer a performance advantage (in terms of per-clock);
    Con: Branch is now beholden to the latency of a Subtract.

    IIRC your cpu clock speed is about 75 MHz (13.3 ns)
    and you are saying it takes 2 clocks for a 64-bit ADD.


    The 75MHz was mostly experimental, mostly I am running at 50MHz because
    it is easier (a whole lot of corners need to be cut for 75MHz, so often overall performance ended up being worse).


    Via the main ALU, which also shares the logic for SUB and CMP and
    similar...

    Generally, I give more or less a full cycle for the ADD to do its thing,
    with the result presented to the outside world on the second cycle,
    where it can go through the register forwarding chains and similar.

    This gives it a 2 cycle latency.

    Operations with a 1 cycle latency need to feed their output directly
    into the register forwarding logic.


    In a pseudocode sense, something like:
    tValB = IsSUB ? ~valB : valB;
    tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
    tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
    tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
    tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
    tAddC0=...
    ...
    tAddSbA = tCarryIn;
    tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
    tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
    ...
    tAddRes = {
    tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
    tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
    tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
    tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
    };


    This works, but still need to ideally give it a full clock-cycle to do
    its work.



    Note that one has to be careful with logic coupling, as if too many
    things are tied together, one may get a "routing congestion" warning
    message, and generally timing fails in this case...

    Also, "inferring latch" warning is one of those "you really gotta go fix this" issues (both generally indicates Verilog bugs, and also negatively effects timing).


    I don't remember what Xilinx chip you are using but this paper describes
    how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
    on a Virtex-5:

    A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
    https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/
    wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/projects/
    project_1_resources/Adders_MELECON_2010.pdf


    As for Virtex: I am not made of money...

    Virtex tends to be absurdly expensive high-end FPGAs.
    Even the older Virtex chips are still absurdly expensive.


    Kintex is considered mid range, but still too expensive, and mostly not usable in the free versions of Vivado (and there are no real viable FOSS alternatives to Vivado). When I tried looking at some of the "open
    source" tools for targeting Xilinx chips, they were doing the hacky
    thing of basically invoking Xilinx's tools in the background (which, if
    used to target a Kintex, is essentially piracy).

    I don't think that it is copyright infringement to have a script or code generator output drive a compiler or tool instead of your hands.

    Where, a valid FOSS tool would need to be able to do everything and
    generate the bitstream itself.



    Mostly I am using Spartan-7 and Artix-7.
    Generally at the -1 speed grade (slowest, but cheapest).

    The second paper was also on both Spartan-6 and says it has the same
    LUT architecture as Vertex-5 and -6. Their speed testing was done on
    Vertex-6 but the design should apply.

    Anyway it was the concepts of how to optimize the carry that were important.
    I would expect to have to write code to port the ideas.

    These are mostly considered low-end and consumer-electronics oriented
    FPGAs by Xilinx.

    <snip>

    I have a QMTech board with an XC7A200T at -1, but generally, it seems to actually have a slightly harder time passing timing constraints than the XC7A100T in the Nexys A7 (possibly some sort of Vivado magic here).


    and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:

    Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
    http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf


    Errm, skim, this doesn't really look like something you can pull off in normal Verilog.

    Well that's what I'm trying to figure out because its not just this paper
    but a lot, like many hundreds, of papers I've read from commercial or
    academic source that seem to be able to control the FPGA results
    to a fine degree.

    Generally, one doesn't control over how the components hook together,
    only one can influence what happens based on how they write their Verilog.

    That paper mentions in section III
    "In order to reduce uncontrollable routing delays in the comparisons, everything was manually placed, according to the floorplan in Fig. 7."

    Is that the key - manually place things adjacent and hope the
    wire router does the right thing?

    That sounds too flaky. You need to be able to reliably construct optimized modules and then attach to them.

    You can just write:
    reg[63:0] tValA;
    reg[63:0] tValB;
    reg[63:0] tValC;
    tValC=tValA+tValB;


    But, then it spits out something with a chain of 16 CARRY4's, so there
    is a fairly high latency on the high order bits of the result.


    Generally, Vivado synthesis seems to mostly be happy (at 50 MHz), if the total logic path length stays under around 12 or so. Paths with 15 or
    more are often near the edge of failing timing.

    At 75MHz, one has to battle with pretty much anything much over 8.


    And, at 200MHz, you have have path lengths of 2 that are failing...
    Like, it seemingly can't do much more than "FF -> LUT -> FF" at these
    speeds.

    This can't just be left to the random luck of the wire router.
    There must be something else that these commercial and academic users
    are able to do to reliably optimize their design.
    Maybe its a tool only available to big bucks customers.

    This has me curious. I'm going to keep looking around.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Sat Feb 22 17:18:30 2025
    Robert Finch wrote:
    On 2025-02-22 10:16 a.m., EricP wrote:
    BGB wrote:

    Generally, Vivado synthesis seems to mostly be happy (at 50 MHz), if
    the total logic path length stays under around 12 or so. Paths with
    15 or more are often near the edge of failing timing.

    At 75MHz, one has to battle with pretty much anything much over 8.


    And, at 200MHz, you have have path lengths of 2 that are failing...
    Like, it seemingly can't do much more than "FF -> LUT -> FF" at these
    speeds.

    This can't just be left to the random luck of the wire router.
    There must be something else that these commercial and academic users
    are able to do to reliably optimize their design.
    Maybe its a tool only available to big bucks customers.

    This has me curious. I'm going to keep looking around.


    I am sure it can be done as I have seen a lot of papers too with results
    in the hundreds of megahertz. It has got to be the manual placement and routing that helps. The routing in my design typically takes up about
    80% of the delay. One can build circuits up out of individual primitive
    gates in Verilog (or(), and(), etc) but for behavioral purposes I do not
    do that, instead relying on the tools to generate the best combinations
    of gates. It is a ton of work to do everything manually. I am happy to
    have things work at 40 MHz even though 200 MHz may be possible with 10x
    the work put into it. Typically running behavioural code. Doing things
    mostly for my own edification. ( I have got my memory controller working
    at 200 MHz, so it is possible).
    One thing that I have found that helps is to use smaller modules and
    tasks for repetitive code where possible. The tools seem to put together
    a faster design if everything is smaller modules. I ponder it may have
    to do with making place and route easier.

    I downloaded a bunch of Vivado User Guides PDF's from AMD/Xilinx.
    They say it can be done. It seems to be done with "constraints" files, assigning properties to the devices and netlists,
    defining relative placement macros, etc.

    It sounds like one can optimize a module independently, what they call
    Out Of Context (OOC), then checkpoint that module design and reload it.

    For BGB it might be sufficient to just optimize the ALU from 2 to 1 clock, checkpoint that module design, and that might lower his synthesis time
    and double his ALU performance.


    From:
    UG892 Vivado Design Suite User Guide Design Flows Overview 2024-11-13

    "Hierarchical Design
    Hierarchical Design (HD) flows enable you to partition a design into smaller, more manageable modules to be processed independently. The hierarchical
    design flow involves proper module interface design, constraint definition, floorplanning, and some special commands and design techniques.

    Using a modular approach to the hierarchical design lets you analyze modules independent of the rest of the design, and reuse modules in the top-down design. A team of users can iterate on specific sections of a design,
    achieving timing closure and other design goals, and reuse the results.

    There are several Vivado features that enable a hierarchical design approach, such as the synthesis of a logic module outside of the context (OOC) of the top-level design. You can select specific modules, or levels of the design hierarchy, and synthesize them OOC. Module-level constraints can be applied
    to optimize and validate module performance. The module design checkpoint
    (DCP) will then be applied during implementation to build the top-level netlist. This method can help reduce top-level synthesis runtime, and
    eliminate re-synthesis of completed modules."

    I haven't found just how to do it yet as the info appears to be spread
    across multiple documents. Some relevant ones may be:

    UG903 Vivado Design Suite User Guide Using Constraints 2024-12-20
    UG904 Vivado Design Suite User Guide Implementation 2024-11-14
    UG905 Vivado Design Suite User Guide Hierarchical Design OBSOLETE 2023-10-18 UG906 Vivado Design Suite User Guide Design Analysis and Closure Techniques
    2024-12-19
    UG912 Vivado Design Suite User Guide Properties Reference Guide 2024-12-18

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Feb 2 11:45:19 2025
    As you can see in the article below, the cost of NOT handling misaligned accesses in hardware is quite high in cpu clocks.

    To my eye, the incremental cost of adding hardware support for misaligned
    to the AGU and cache data path should be quite low. The alignment shifter
    is basically the same: assuming a 64-byte cache line, LD still has to
    shift any of the 64 bytes into position 0, and reverse for ST.

    The incremental cost is in a sequencer in the AGU for handling cache
    line and possibly virtual page straddles, and a small byte shifter to
    left shift the high order bytes. The AGU sequencer needs to know if the
    line straddles a page boundary, if not then increment the 6-bit physical
    line number within the 4 kB physical frame number, if yes then increment virtual page number and TLB lookup again and access the first line.
    (Slightly more if multiple page sizes are supported, but same idea.)
    For a load AGU merges the low and high fragments and forwards.

    I don't think there are line straddle consequences for coherence because
    there is no ordering guarantees for misaligned accesses.

    The hardware cost appears trivial, especially within an OoO core.
    So there doesn't appear to be any reason to not handle this.
    Am I missing something?

    https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-microarchitecture/

    [about half way down]

    "Before accessing cache, load addresses have to be checked against
    older stores (and vice versa) to ensure proper ordering. If there is a dependency, P550 can only do fast store forwarding if the load and store addresses match exactly and both accesses are naturally aligned.
    Any unaligned access, dependent or not, confuses P550 for hundreds of
    cycles. Worse, the unaligned loads and stores donÆt proceed in parallel.
    An unaligned load takes 1062 cycles, an unaligned store takes
    741 cycles, and the two together take over 1800 cycles.

    This terrible unaligned access behavior is atypical even for low power
    cores. ArmÆs Cortex A75 only takes 15 cycles in the worst case of
    dependent accesses that are both misaligned.

    Digging deeper with performance counters reveals executing each unaligned
    load instruction results in ~505 executed instructions. P550 almost
    certainly doesnÆt have hardware support for unaligned accesses.
    Rather, itÆs likely raising a fault and letting an operating system
    handler emulate it in software."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sun Feb 2 17:44:58 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    The incremental cost is in a sequencer in the AGU for handling cache
    line and possibly virtual page straddles, and a small byte shifter to
    left shift the high order bytes. The AGU sequencer needs to know if the
    line straddles a page boundary, if not then increment the 6-bit physical
    line number within the 4 kB physical frame number, if yes then increment >virtual page number and TLB lookup again and access the first line.
    (Slightly more if multiple page sizes are supported, but same idea.)
    For a load AGU merges the low and high fragments and forwards.
    ...
    The hardware cost appears trivial, especially within an OoO core.
    So there doesn't appear to be any reason to not handle this.
    Am I missing something?

    The OS must also be able to keep both pages in physical memory until
    the access is complete, or there will be no progress. Should not be a
    problem these days, but the 48 pages or so potentially needed by VAX complicated the OS.

    Yes, hardware is not hard, there is software that benefits, and as a
    result, modern architectures (including RISC-V) now support unaligned
    accesses (except for atomic accesses).

    https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-microarchitecture/
    ...
    This terrible unaligned access behavior is atypical even for low power
    cores. Arm's Cortex A75 only takes 15 cycles in the worst case of
    dependent accesses that are both misaligned.

    Digging deeper with performance counters reveals executing each unaligned >load instruction results in ~505 executed instructions.

    This is similar to what I measured on an U74 core from SiFive <2024May14.073553@mips.complang.tuwien.ac.at>, so they probably use
    the same solution.

    P550 almost
    certainly doesnÆt have hardware support for unaligned accesses.
    Rather, itÆs likely raising a fault and letting an operating system
    handler emulate it in software."

    The architecture guarantees that unaligned accesses work, so the OS
    might not have support for such emulation. Another option would be to
    trap into some kind of firmware-supplied fixup code, along the lines
    of Alpha's PALcode.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sun Feb 2 18:10:35 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The OS must also be able to keep both pages in physical memory until
    the access is complete, or there will be no progress. Should not be a problem these days, but the 48 pages or so potentially needed by VAX complicated the OS.

    48 pages? What instruction would need that?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Feb 2 18:51:33 2025
    On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:

    As you can see in the article below, the cost of NOT handling misaligned accesses in hardware is quite high in cpu clocks.

    To my eye, the incremental cost of adding hardware support for
    misaligned
    to the AGU and cache data path should be quite low. The alignment
    shifter
    is basically the same: assuming a 64-byte cache line, LD still has to
    shift any of the 64 bytes into position 0, and reverse for ST.

    A handful of gates to detect misalignedness and recognize the line and
    page crossing misalignments.

    The alignment shifters are twice as big.

    Now, while I accept these costs, I accept that others may not. I accept
    these costs because of the performance issues when I don't.

    The incremental cost is in a sequencer in the AGU for handling cache
    line and possibly virtual page straddles, and a small byte shifter to
    left shift the high order bytes. The AGU sequencer needs to know if the
    line straddles a page boundary, if not then increment the 6-bit physical
    line number within the 4 kB physical frame number, if yes then increment virtual page number and TLB lookup again and access the first line.
    (Slightly more if multiple page sizes are supported, but same idea.)
    For a load AGU merges the low and high fragments and forwards.

    I don't think there are line straddle consequences for coherence because there is no ordering guarantees for misaligned accesses.

    Generally stated as:: Misaligned accesses cannot be considered ATOMIC.

    The hardware cost appears trivial, especially within an OoO core.
    So there doesn't appear to be any reason to not handle this.
    Am I missing something?

    https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-microarchitecture/

    [about half way down]

    "Before accessing cache, load addresses have to be checked against
    older stores (and vice versa) to ensure proper ordering. If there is a dependency, P550 can only do fast store forwarding if the load and store addresses match exactly and both accesses are naturally aligned.
    Any unaligned access, dependent or not, confuses P550 for hundreds of
    cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
    An unaligned load takes 1062 cycles, an unaligned store takes
    741 cycles, and the two together take over 1800 cycles.

    This terrible unaligned access behavior is atypical even for low power
    cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
    dependent accesses that are both misaligned.

    Digging deeper with performance counters reveals executing each
    unaligned
    load instruction results in ~505 executed instructions. P550 almost
    certainly doesn’t have hardware support for unaligned accesses.
    Rather, it’s likely raising a fault and letting an operating system
    handler emulate it in software."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Feb 2 18:55:01 2025
    On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:

    https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-microarchitecture/

    [about half way down]

    "Before accessing cache, load addresses have to be checked against
    older stores (and vice versa) to ensure proper ordering. If there is a dependency, P550 can only do fast store forwarding if the load and store addresses match exactly and both accesses are naturally aligned.
    Any unaligned access, dependent or not, confuses P550 for hundreds of
    cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
    An unaligned load takes 1062 cycles, an unaligned store takes
    741 cycles, and the two together take over 1800 cycles.

    This terrible unaligned access behavior is atypical even for low power
    cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
    dependent accesses that are both misaligned.

    Digging deeper with performance counters reveals executing each
    unaligned
    load instruction results in ~505 executed instructions. P550 almost
    certainly doesn’t have hardware support for unaligned accesses.
    Rather, it’s likely raising a fault and letting an operating system
    handler emulate it in software."

    1800 cycles divided by 505 instructions is 3.6 cycles per instruction
    or 0.277 instructions per cycle--compared to an extra cycle or two
    when HW does it all by itself.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Mon Feb 3 01:43:55 2025
    On Sun, 2 Feb 2025 22:44:13 +0000, Chris M. Thomasson wrote:

    On 2/2/2025 10:51 AM, MitchAlsup1 wrote:
    On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
    -------------
    I don't think there are line straddle consequences for coherence because >>> there is no ordering guarantees for misaligned accesses.

    Generally stated as:: Misaligned accesses cannot be considered ATOMIC.

    Try it on an x86/x64. Straddle a l2 cache line and use it with a LOCK'ed
    RMW. It should assert the BUS lock.

    Consider this approach when you have a cabinet of slid in servers,
    each server having 128 cores, the cabinet being cache coherent,
    and the cabinet having 4096 cores.

    Can you say "it donna scale" ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Feb 3 01:51:09 2025
    On Sun, 2 Feb 2025 22:45:53 +0000, BGB wrote:

    On 2/2/2025 10:45 AM, EricP wrote:
    As you can see in the article below, the cost of NOT handling misaligned
    accesses in hardware is quite high in cpu clocks.

    To my eye, the incremental cost of adding hardware support for
    misaligned
    to the AGU and cache data path should be quite low. The alignment
    shifter
    is basically the same: assuming a 64-byte cache line, LD still has to
    shift any of the 64 bytes into position 0, and reverse for ST.

    The incremental cost is in a sequencer in the AGU for handling cache
    line and possibly virtual page straddles, and a small byte shifter to
    left shift the high order bytes. The AGU sequencer needs to know if the
    line straddles a page boundary, if not then increment the 6-bit physical
    line number within the 4 kB physical frame number, if yes then increment
    virtual page number and TLB lookup again and access the first line.
    (Slightly more if multiple page sizes are supported, but same idea.)
    For a load AGU merges the low and high fragments and forwards.

    I don't think there are line straddle consequences for coherence because
    there is no ordering guarantees for misaligned accesses.


    IMO, the main costs of unaligned access in hardware:
    Cache may need two banks of cache lines
    lets call them "even" and "odd".
    an access crossing a line boundary may need both an even and odd
    line;
    slightly more expensive extract and insert logic.

    The main costs of not having unaligned access in hardware:
    Code either faults or performs like dog crap;
    Some pieces of code need convoluted workarounds;
    Some algorithms have no choice other than to perform like crap.


    Even if most of the code doesn't need unaligned access, the parts that
    do need it, significantly need it to perform well.

    Well, at least excluding wonk in the ISA, say:
    A load/store pair that discards the low-order bits;
    An extract/insert instruction that operates on a register pair using the LOB's of the pointer.

    In effect, something vaguely akin (AFAIK) to what existed on the DEC
    Alpha.


    The hardware cost appears trivial, especially within an OoO core.
    So there doesn't appear to be any reason to not handle this.
    Am I missing something?


    For an OoO core, any cost difference in the L1 cache here is likely to
    be negligible.


    For anything much bigger than a small microcontroller, I would assume designing a core that handles unaligned access effectively.


    https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-
    microarchitecture/

    [about half way down]

    "Before accessing cache, load addresses have to be checked against
    older stores (and vice versa) to ensure proper ordering. If there is a
    dependency, P550 can only do fast store forwarding if the load and store
    addresses match exactly and both accesses are naturally aligned.
    Any unaligned access, dependent or not, confuses P550 for hundreds of
    cycles. Worse, the unaligned loads and stores don’t proceed in parallel. >> An unaligned load takes 1062 cycles, an unaligned store takes
    741 cycles, and the two together take over 1800 cycles.

    This terrible unaligned access behavior is atypical even for low power
    cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
    dependent accesses that are both misaligned.

    Digging deeper with performance counters reveals executing each
    unaligned
    load instruction results in ~505 executed instructions. P550 almost
    certainly doesn’t have hardware support for unaligned accesses.
    Rather, it’s likely raising a fault and letting an operating system
    handler emulate it in software."


    An emulation fault, or something similarly nasty...


    At that point, even turning any potentially unaligned load or store into
    a runtime call is likely to be a lot cheaper.

    Say:
    __mem_ld_unaligned:
    ANDI X15, X10, 7
    BEQ .aligned, X15, X0
    SUB X14, X10, X15
    LW X12, 0(X14)
    LW X13, 8(X14)
    SLLI X14, X15, 3
    LI X17, 64
    SUB X16, X17, X14
    SRL X12, X12, X14
    SLL X13, X13, X16
    OR X10, X12, X13
    RET
    .aligned:
    LW X10, 0(X10)
    RET

    The aligned case being because SRL with 64 will simply give the input
    (since (64&63)==0), causing it to break.


    Though not supported by GCC or similar, dedicated __aligned and
    __unaligned keywords could help here, to specify which pointers are
    aligned (no function call), unaligned (needs function call) and default (probably aligned).

    All of which vanish when the HW does misaligned accesses.
    {{It makes the job of the programmer easier}}

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB on Sun Feb 2 21:03:29 2025
    BGB wrote:
    On 2/2/2025 12:10 PM, Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The OS must also be able to keep both pages in physical memory until
    the access is complete, or there will be no progress. Should not be a
    problem these days, but the 48 pages or so potentially needed by VAX
    complicated the OS.

    48 pages? What instruction would need that?

    Hmm...


    I ended up with a 4-way set associative TLB as it ended up being needed
    to avoid the CPU getting stuck in a TLB-miss loop in the worst-case
    scenario:
    An instruction fetch where the line-pair crosses a page boundary (and L1
    I$ misses) for an instruction accessing a memory address where the
    line-pair also crosses a page boundary (and the L1 D$ misses).

    One can almost get away with two-way, except that almost inevitably the
    CPU would encounter and get stuck in an infinite TLB miss loop (despite
    the seeming rarity, happens roughly once every few seconds or so).

    ....


    That is because you have a software managed TLB so all PTE's
    referenced by an instruction must be resident in TLB for success.
    If three PTE are required by an instruction and they map to
    the same 2-way row and conflict evict then bzzzzt livelock loop.

    So you need at least as many set assoc TLB ways as the worst case VA's referenced by any instruction.

    With a HW table walker you can just let it evict and reload.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Sun Feb 2 21:24:21 2025
    EricP wrote:
    BGB wrote:
    On 2/2/2025 12:10 PM, Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The OS must also be able to keep both pages in physical memory until
    the access is complete, or there will be no progress. Should not be a >>>> problem these days, but the 48 pages or so potentially needed by VAX
    complicated the OS.

    48 pages? What instruction would need that?

    Hmm...


    I ended up with a 4-way set associative TLB as it ended up being
    needed to avoid the CPU getting stuck in a TLB-miss loop in the
    worst-case scenario:
    An instruction fetch where the line-pair crosses a page boundary (and
    L1 I$ misses) for an instruction accessing a memory address where the
    line-pair also crosses a page boundary (and the L1 D$ misses).

    One can almost get away with two-way, except that almost inevitably
    the CPU would encounter and get stuck in an infinite TLB miss loop
    (despite the seeming rarity, happens roughly once every few seconds or
    so).

    ....


    That is because you have a software managed TLB so all PTE's
    referenced by an instruction must be resident in TLB for success.
    If three PTE are required by an instruction and they map to
    the same 2-way row and conflict evict then bzzzzt livelock loop.

    So you need at least as many set assoc TLB ways as the worst case VA's referenced by any instruction.

    And this just accounts for the instruction that TLB-miss'ed.
    If the TLB-miss handler code or data itself can possibly conflict
    on the same TLB row then you have to add 2, 3 or 4 more ways for it.

    Also assumes FIFO or LRU reuse of ways in a row. If victim way is
    random selected then you need extra ways to add some spare pad and
    the odds in succeeding become statistical.

    With a HW table walker you can just let it evict and reload.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Mon Feb 3 06:55:50 2025
    BGB <cr88192@gmail.com> writes:
    On 2/2/2025 10:45 AM, EricP wrote:
    Digging deeper with performance counters reveals executing each unaligned
    load instruction results in ~505 executed instructions. P550 almost
    certainly doesn’t have hardware support for unaligned accesses.
    Rather, it’s likely raising a fault and letting an operating system
    handler emulate it in software."


    An emulation fault, or something similarly nasty...


    At that point, even turning any potentially unaligned load or store into
    a runtime call is likely to be a lot cheaper.

    There are lots of potentially unaligned loads and stores. There are
    very few actually unaligned loads and stores: On Linux-Alpha every
    unaligned access is logged by default, and the number of
    unaligned-access entries in the logs of our machines was relatively
    small (on average a few per day). So trapping actual unaligned
    accesses was faster than replacing potential unaligned accesses with
    code sequences that synthesize the unaligned access from aligned
    accesses.

    Of course, if the cost of unaligned accesses is that high, you will
    avoid them in cases like block copies where cheap unaligned accesses
    would otherwise be beneficial.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sun Feb 9 14:17:44 2025
    On Sat, 08 Feb 2025 17:46:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sat, 08 Feb 2025 08:11:04 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    That's very disappointing. Haswell has 4-wide front
    end and majority of AVX2 integer instruction is limited to throughput
    of two per clock. Golden Cove has 5+ wide front end and nearly all
    AVX2 integer instruction have throughput of three per clock.
    Could it be that clang introduced some sort of latency bottleneck?

    As far as I looked into the code, I did not see such a bottleneck.
    Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
    clang keylocks2-256), and I expect that it would suffer from a general latency bottleneck, too. Rocket Lake is also faster on this program
    than Haswell and Golden Cove. It seems to be just that this program
    rubs Golden Cove the wrong way.


    Did you look at the code in the outer loop as well?
    The number of iterations in the inner loop is not huge, so excessive
    folding of accumulators in the outer loop could be a problem too.
    It shouldn't, theoretically, but somehow it could.

    And if you still didn't manage to get my source compiled, here is
    another version, slightly less clever, but more importantly, formatted
    with shorter lines:

    #include <stdint.h>
    #include <immintrin.h>

    #define BROADCAST_u32(p) \
    _mm256_castps_si256(_mm256_broadcast_ss((const float*)(p)))

    #define ADD_NZ(acc, x, y) _mm256_sub_epi32(acc, _mm256_cmpeq_epi32 \
    (_mm256_and_si256(x, y), _mm256_setzero_si256()))

    int foo_tst(const uint32_t* keylocks, int len, int li)
    {
    if (li >= len || li <= 0)
    return 0;
    const uint32_t* px = &keylocks[li];
    unsigned nx = len - li;
    __m256i res0 = _mm256_setzero_si256();
    __m256i res1 = _mm256_setzero_si256();
    __m256i res2 = _mm256_setzero_si256();
    __m256i res3 = _mm256_setzero_si256();

    int nx1 = nx & 31;
    if (nx1) {
    const uint32_t* px_last = &px[nx1];
    // process head, 8 x values per loop
    static const int32_t masks[15] = {
    -1, -1, -1, -1, -1, -1, -1, -1,
    0, 0, 0, 0, 0, 0, 0,
    };
    int rem0 = (-nx) & 7;
    __m256i mask = _mm256_loadu_si256((const __m256i*)&masks[rem0]);
    __m256i x = _mm256_maskload_epi32((const int*)px, mask);
    px += 8 - rem0;
    const uint32_t* py1 = &keylocks[li & -4];
    const uint32_t* py2 = &keylocks[li];
    for (;;) {
    const uint32_t* py;
    for (py = keylocks; py != py1; py += 4) {
    res0 = ADD_NZ(res0, x, BROADCAST_u32(&py[0]));
    res1 = ADD_NZ(res1, x, BROADCAST_u32(&py[1]));
    res2 = ADD_NZ(res2, x, BROADCAST_u32(&py[2]));
    res3 = ADD_NZ(res3, x, BROADCAST_u32(&py[3]));
    }
    for (; py != py2; py += 1)
    res0 = ADD_NZ(res0, x, BROADCAST_u32(py));
    if (px == px_last)
    break;
    x = _mm256_loadu_si256((const __m256i*)px);
    px += 8;
    }
    }

    int nx2 = nx & -32;
    const uint32_t* px_last = &px[nx2];
    for (; px != px_last; px += 32) {
    __m256i x0 = _mm256_loadu_si256((const __m256i*)&px[0*8]);
    __m256i x1 = _mm256_loadu_si256((const __m256i*)&px[1*8]);
    __m256i x2 = _mm256_loadu_si256((const __m256i*)&px[2*8]);
    __m256i x3 = _mm256_loadu_si256((const __m256i*)&px[3*8]);
    for (const uint32_t* py = keylocks; py != &keylocks[li]; ++py) {
    __m256i y = BROADCAST_u32(py);
    res0 = ADD_NZ(res0, y, x0);
    res1 = ADD_NZ(res1, y, x1);
    res2 = ADD_NZ(res2, y, x2);
    res3 = ADD_NZ(res3, y, x3);
    }
    }
    // fold accumulators
    res0 = _mm256_add_epi32(res0, res2);
    res1 = _mm256_add_epi32(res1, res3);
    res0 = _mm256_add_epi32(res0, res1);
    res0 = _mm256_hadd_epi32(res0, res0);
    res0 = _mm256_hadd_epi32(res0, res0);
    int res = _mm256_extract_epi32(res0, 0)
    + _mm256_extract_epi32(res0, 4);
    return res - (-nx & 7) * li;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sun Feb 9 22:37:57 2025
    On Sat, 08 Feb 2025 08:11:04 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    You can find the source code and the binaries I measured at <http://www.complang.tuwien.ac.at/anton/keylock/>.

    The server is so slow that it is unusable.
    Trasfer rate is order of 1-2 bytes/s.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Mon Feb 17 10:00:20 2025
    On 2025-02-17, Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    No, the real problem is when a compiler want to auto-vectorize any code working with 1/2/4/8 byte items: All of a sudden the alignment
    requirement went from the item stride to the vector register stride
    (16/32/64 bytes).

    The only way this can work is to have the compiler control _all_
    allocations to make sure they are properly aligned, including code in libraries, or the compiler will be forced to use vector load/store
    operations which do allow unaligned access.

    Not necessarily the compiler's choice - compiler-generated code
    has to deal with everything that conforms to the ABI, and if that
    specifies 8-byte aligned pointers to doubles, the compiler cannot
    assume otherwise unless directed.

    Loop peeling might help, but becomes difficult when more than
    one pointer is involved. Consider a dot product calculation
    which you want to vectorize with 256-bit SIMD instructions,
    with pointers a and b.

    You then have to deal with the case (uintptr_t) a % 32 == 1
    and (uintptr_t) a % 32 == 3, for example.

    Or you can use an extension, like __attribute__ ((aligned(32))).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Marcus on Mon Feb 17 10:37:57 2025
    Marcus wrote:
    On 2025-02-03, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 2/2/2025 10:45 AM, EricP wrote:
    Digging deeper with performance counters reveals executing each
    unaligned
    load instruction results in ~505 executed instructions. P550 almost
    certainly doesn’t have hardware support for unaligned accesses. >>>> Rather, it’s likely raising a fault and letting an operating system >>>> handler emulate it in software."


    An emulation fault, or something similarly nasty...


    At that point, even turning any potentially unaligned load or store into >>> a runtime call is likely to be a lot cheaper.

    There are lots of potentially unaligned loads and stores.  There are
    very few actually unaligned loads and stores: On Linux-Alpha every
    unaligned access is logged by default, and the number of
    unaligned-access entries in the logs of our machines was relatively
    small (on average a few per day).  So trapping actual unaligned
    accesses was faster than replacing potential unaligned accesses with
    code sequences that synthesize the unaligned access from aligned
    accesses.

    If you compile regular C/C++ code that does not intentionally do any
    nasty stuff, you will typically have zero unaligned loads stores.

    My machine still does not support unaligned accesses in hardware (it's
    on the todo list), and it can run an awful lot of software without
    problems.

    The problem arises when the programmer *deliberately* does unaligned
    loads and stores in order to improve performance. Or rather, if the programmer knows that the hardware supports unaligned loads and stores, he/she can use that to write faster code in some special cases.

    No, the real problem is when a compiler want to auto-vectorize any code working with 1/2/4/8 byte items: All of a sudden the alignment
    requirement went from the item stride to the vector register stride
    (16/32/64 bytes).

    The only way this can work is to have the compiler control _all_
    allocations to make sure they are properly aligned, including code in libraries, or the compiler will be forced to use vector load/store
    operations which do allow unaligned access.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB on Sun Feb 23 11:13:53 2025
    BGB wrote:
    On 2/22/2025 1:25 PM, Robert Finch wrote:
    On 2025-02-22 10:16 a.m., EricP wrote:
    BGB wrote:
    On 2/21/2025 1:51 PM, EricP wrote:

    and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:

    Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs,
    2016
    http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf


    Errm, skim, this doesn't really look like something you can pull off
    in normal Verilog.

    Well that's what I'm trying to figure out because its not just this
    paper
    but a lot, like many hundreds, of papers I've read from commercial or
    academic source that seem to be able to control the FPGA results
    to a fine degree.


    You could invoke some of the LE's directly as primitives in Verilog, but
    then one has an ugly mess that will only work on a specific class of FPGA.

    Generally though, one has access in terms of said primitives, rather
    than control over the logic block.


    Vs, say, code that will work with Verilator, Vivado, and Quartus,
    without needing to be entirely rewritten for each.


    Though, that said, my design might still need some reworking to be "effective" with Quartus or Altera hardware; or to use the available hardware.

    Ok but this "portability" appears to be costing you dearly.

    Say, rather than like on a Spartan or Artix (pure FPGA), the Cyclone
    FPGA's tend to include ARM hard processors, with the FPGA and ARM cores
    able to communicate over a bus. The FPGA part of the DE10 apparently has
    its own RAM chip, but it is SDRAM (rather than DDR2 or DDR3 like in a
    lot of the Xilinx based boards).

    Well, apart from some low-end boards which use QSPI SRAMs (though,
    having looked, a lot of these RAMs are DRAM internally, but the RAM
    module has its own RAM refresh logic).



    This can't just be left to the random luck of the wire router.
    There must be something else that these commercial and academic users
    are able to do to reliably optimize their design.
    Maybe its a tool only available to big bucks customers.

    This has me curious. I'm going to keep looking around.


    I am sure it can be done as I have seen a lot of papers too with
    results in the hundreds of megahertz. It has got to be the manual
    placement and routing that helps. The routing in my design typically
    takes up about 80% of the delay. One can build circuits up out of
    individual primitive gates in Verilog (or(), and(), etc) but for
    behavioral purposes I do not do that, instead relying on the tools to
    generate the best combinations of gates. It is a ton of work to do
    everything manually. I am happy to have things work at 40 MHz even
    though 200 MHz may be possible with 10x the work put into it.
    Typically running behavioural code. Doing things mostly for my own
    edification. ( I have got my memory controller working at 200 MHz, so
    it is possible).
    One thing that I have found that helps is to use smaller modules and
    tasks for repetitive code where possible. The tools seem to put
    together a faster design if everything is smaller modules. I ponder it
    may have to do with making place and route easier.


    It is also possible to get higher speeds with smaller/simple designs.

    But, yeah, also I can note in Vivado, that the timing does tend to be dominated more by "net delay" rather than "logic delay".



    This is why my thoughts for a possible 75 MHz focused core would be to
    drop down to 2-wide superscalar. It is more a question of what could be
    done to try to leverage the higher clock-speed to an advantage (and not
    lose too much performance in other areas).

    You are missing my point. You are trying work around a problem with
    low level module design by rearranging high level architecture components.

    It sounds like your ALU stage is taking about 20 ns to do an ADD
    and that is having consequences that ripple through the design,
    like taking an extra clock for result forwarding,
    which causes performance issues when considering Compare And Branch,
    and would cause a stall with back-to-back operations.

    This goes back to module optimization where you said:

    BGB wrote:
    On 2/21/2025 1:51 PM, EricP wrote:

    and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:

    Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
    http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf


    Errm, skim, this doesn't really look like something you can pull off in normal Verilog.

    Generally, one doesn't control over how the components hook together,
    only one can influence what happens based on how they write their Verilog.

    You can just write:
    reg[63:0] tValA;
    reg[63:0] tValB;
    reg[63:0] tValC;
    tValC=tValA+tValB;


    But, then it spits out something with a chain of 16 CARRY4's, so there
    is a fairly high latency on the high order bits of the result.

    It looks to me that Vivado intends that after you get your basic design working, this module optimization is *exactly* what one is supposed to do.

    In this case the prototype design establishes that you need multiple
    64-bit adders and the generic ones synthesis spits out are slow.
    So you isolate that module off, use Verilog to drive the basic LE
    selections, then iterate doing relative LE placement specifiers,
    route the module, and when you get the fastest 64-bit adder you can
    then lock down the netlist and save the module design.

    Now you have a plug-in 64-bit adder module that runs at (I don't know
    the speed difference between Virtex and your Spartan-7 so wild guess)
    oh, say, 4 ns, to use multiple places... fetch, decode, alu, agu.

    Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
    isolate that module, optimize placement, route, lock down netlist,
    and now you have a 5 ns plug-in ALU module.

    Doing this you build up your own IP library of optimized hardware modules.

    As more and more modules are optimized the system synthesis gets faster
    because much of the fine grain work and routing is already done.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Mon Feb 24 00:08:24 2025
    On Sun, 23 Feb 2025 11:13:53 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    BGB wrote:
    On 2/22/2025 1:25 PM, Robert Finch wrote:
    On 2025-02-22 10:16 a.m., EricP wrote:
    BGB wrote:
    On 2/21/2025 1:51 PM, EricP wrote:

    and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:

    Fast and Area Efficient Adder for Wide Data in Recent Xilinx
    FPGAs, 2016
    http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf


    Errm, skim, this doesn't really look like something you can pull
    off in normal Verilog.

    Well that's what I'm trying to figure out because its not just
    this paper
    but a lot, like many hundreds, of papers I've read from
    commercial or academic source that seem to be able to control the
    FPGA results to a fine degree.


    You could invoke some of the LE's directly as primitives in
    Verilog, but then one has an ugly mess that will only work on a
    specific class of FPGA.

    Generally though, one has access in terms of said primitives,
    rather than control over the logic block.


    Vs, say, code that will work with Verilator, Vivado, and Quartus,
    without needing to be entirely rewritten for each.


    Though, that said, my design might still need some reworking to be "effective" with Quartus or Altera hardware; or to use the
    available hardware.

    Ok but this "portability" appears to be costing you dearly.

    Say, rather than like on a Spartan or Artix (pure FPGA), the
    Cyclone FPGA's tend to include ARM hard processors, with the FPGA
    and ARM cores able to communicate over a bus. The FPGA part of the
    DE10 apparently has its own RAM chip, but it is SDRAM (rather than
    DDR2 or DDR3 like in a lot of the Xilinx based boards).

    Well, apart from some low-end boards which use QSPI SRAMs (though,
    having looked, a lot of these RAMs are DRAM internally, but the RAM
    module has its own RAM refresh logic).



    This can't just be left to the random luck of the wire router.
    There must be something else that these commercial and academic
    users are able to do to reliably optimize their design.
    Maybe its a tool only available to big bucks customers.

    This has me curious. I'm going to keep looking around.


    I am sure it can be done as I have seen a lot of papers too with
    results in the hundreds of megahertz. It has got to be the manual
    placement and routing that helps. The routing in my design
    typically takes up about 80% of the delay. One can build circuits
    up out of individual primitive gates in Verilog (or(), and(), etc)
    but for behavioral purposes I do not do that, instead relying on
    the tools to generate the best combinations of gates. It is a ton
    of work to do everything manually. I am happy to have things work
    at 40 MHz even though 200 MHz may be possible with 10x the work
    put into it. Typically running behavioural code. Doing things
    mostly for my own edification. ( I have got my memory controller
    working at 200 MHz, so it is possible).
    One thing that I have found that helps is to use smaller modules
    and tasks for repetitive code where possible. The tools seem to
    put together a faster design if everything is smaller modules. I
    ponder it may have to do with making place and route easier.


    It is also possible to get higher speeds with smaller/simple
    designs.

    But, yeah, also I can note in Vivado, that the timing does tend to
    be dominated more by "net delay" rather than "logic delay".



    This is why my thoughts for a possible 75 MHz focused core would be
    to drop down to 2-wide superscalar. It is more a question of what
    could be done to try to leverage the higher clock-speed to an
    advantage (and not lose too much performance in other areas).

    You are missing my point. You are trying work around a problem with
    low level module design by rearranging high level architecture
    components.

    It sounds like your ALU stage is taking about 20 ns to do an ADD
    and that is having consequences that ripple through the design,
    like taking an extra clock for result forwarding,
    which causes performance issues when considering Compare And Branch,
    and would cause a stall with back-to-back operations.

    This goes back to module optimization where you said:

    BGB wrote:
    On 2/21/2025 1:51 PM, EricP wrote:

    and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:

    Fast and Area Efficient Adder for Wide Data in Recent Xilinx
    FPGAs, 2016
    http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf

    Errm, skim, this doesn't really look like something you can pull
    off in normal Verilog.

    Generally, one doesn't control over how the components hook
    together, only one can influence what happens based on how they
    write their Verilog.

    You can just write:
    reg[63:0] tValA;
    reg[63:0] tValB;
    reg[63:0] tValC;
    tValC=tValA+tValB;


    But, then it spits out something with a chain of 16 CARRY4's, so
    there is a fairly high latency on the high order bits of the
    result.

    It looks to me that Vivado intends that after you get your basic
    design working, this module optimization is *exactly* what one is
    supposed to do.

    In this case the prototype design establishes that you need multiple
    64-bit adders and the generic ones synthesis spits out are slow.
    So you isolate that module off, use Verilog to drive the basic LE
    selections, then iterate doing relative LE placement specifiers,
    route the module, and when you get the fastest 64-bit adder you can
    then lock down the netlist and save the module design.

    Now you have a plug-in 64-bit adder module that runs at (I don't know
    the speed difference between Virtex and your Spartan-7 so wild guess)
    oh, say, 4 ns, to use multiple places... fetch, decode, alu, agu.

    Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
    isolate that module, optimize placement, route, lock down netlist,
    and now you have a 5 ns plug-in ALU module.

    Doing this you build up your own IP library of optimized hardware
    modules.

    As more and more modules are optimized the system synthesis gets
    faster because much of the fine grain work and routing is already
    done.



    It sounds like your 1st hand FPGA design experience is VERY outdated.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Mon Feb 3 08:34:13 2025
    BGB <cr88192@gmail.com> writes:
    On 2/3/2025 12:55 AM, Anton Ertl wrote:
    Rather, have something like an explicit "__unaligned" keyword or
    similar, and then use the runtime call for these pointers.

    There are people who think that it is ok to compile *p to anything if
    p is not aligned, even on architectures that support unaligned
    accesses. At least one of those people recommended the use of
    memcpy(..., ..., sizeof(...)). Let's see what gcc produces on
    rv64gc (where unaligned accesses are guaranteed to work):

    [fedora-starfive:/tmp:111378] cat x.c
    #include <string.h>

    long uload(long *p)
    {
    long x;
    memcpy(&x,p,sizeof(long));
    return x;
    }
    [fedora-starfive:/tmp:111379] gcc -O -S x.c
    [fedora-starfive:/tmp:111380] cat x.s
    .file "x.c"
    .option nopic
    .text
    .align 1
    .globl uload
    .type uload, @function
    uload:
    addi sp,sp,-16
    lbu t1,0(a0)
    lbu a7,1(a0)
    lbu a6,2(a0)
    lbu a1,3(a0)
    lbu a2,4(a0)
    lbu a3,5(a0)
    lbu a4,6(a0)
    lbu a5,7(a0)
    sb t1,8(sp)
    sb a7,9(sp)
    sb a6,10(sp)
    sb a1,11(sp)
    sb a2,12(sp)
    sb a3,13(sp)
    sb a4,14(sp)
    sb a5,15(sp)
    ld a0,8(sp)
    addi sp,sp,16
    jr ra
    .size uload, .-uload
    .ident "GCC: (GNU) 10.3.1 20210422 (Red Hat 10.3.1-1)"
    .section .note.GNU-stack,"",@progbits

    Oh boy. Godbolt tells me that gcc-14.2.0 still does it the same way,
    whereas clang 9.0.0 and following produce

    [fedora-starfive:/tmp:111383] clang -O -S x.c
    [fedora-starfive:/tmp:111384] cat x.s
    .text
    .attribute 4, 16
    .attribute 5, "rv64i2p0_m2p0_a2p0_f2p0_d2p0_c2p0"
    .file "x.c"
    .globl uload # -- Begin function uload
    .p2align 1
    .type uload,@function
    uload: # @uload
    .cfi_startproc
    # %bb.0:
    ld a0, 0(a0)
    ret
    .Lfunc_end0:
    .size uload, .Lfunc_end0-uload
    .cfi_endproc
    # -- End function
    .ident "clang version 11.0.0 (Fedora 11.0.0-2.0.riscv64.fc33)"
    .section ".note.GNU-stack","",@progbits
    .addrsig

    If that is frequently used for unaligned p, this will be slow on the
    U74 and P550. Maybe SiFive should get around to implementing
    unaligned accesses more efficiently.

    Though "memcpy()" is usually a "simple to fix up" scenario.

    General memcpy where both operands may be unaligned in different ways
    is not particularly simple. This also shows up in the fact that Intel
    and AMD have failed to make REP MOVSB faster than software approaches
    for many cases when I last looked. Supposedly Intel has had another
    go at it, I should measure it again.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Feb 3 13:49:46 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 2 Feb 2025 22:44:13 +0000, Chris M. Thomasson wrote:

    On 2/2/2025 10:51 AM, MitchAlsup1 wrote:
    On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
    -------------
    I don't think there are line straddle consequences for coherence because >>>> there is no ordering guarantees for misaligned accesses.

    Generally stated as:: Misaligned accesses cannot be considered ATOMIC.

    Try it on an x86/x64. Straddle a l2 cache line and use it with a LOCK'ed
    RMW. It should assert the BUS lock.

    Consider this approach when you have a cabinet of slid in servers,
    each server having 128 cores, the cabinet being cache coherent,
    and the cabinet having 4096 cores.

    Can you say "it donna scale" ??

    We (3Leaf Systems) learned that the hard way 20 years ago. AMD and Intel processors
    will sometimes assert the BUS lock under high contention for a target cache line,
    even in cases where the access is aligned and doesn't straddle a page boundary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Mon Feb 3 16:23:00 2025
    On Mon, 03 Feb 2025 13:49:46 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 2 Feb 2025 22:44:13 +0000, Chris M. Thomasson wrote:

    On 2/2/2025 10:51 AM, MitchAlsup1 wrote:
    On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
    -------------
    I don't think there are line straddle consequences for coherence
    because there is no ordering guarantees for misaligned accesses.


    Generally stated as:: Misaligned accesses cannot be considered
    ATOMIC.

    Try it on an x86/x64. Straddle a l2 cache line and use it with a
    LOCK'ed RMW. It should assert the BUS lock.

    Consider this approach when you have a cabinet of slid in servers,
    each server having 128 cores, the cabinet being cache coherent,
    and the cabinet having 4096 cores.

    Can you say "it donna scale" ??

    We (3Leaf Systems) learned that the hard way 20 years ago. AMD and
    Intel processors will sometimes assert the BUS lock under high
    contention for a target cache line, even in cases where the access is
    aligned and doesn't straddle a page boundary.


    According to my understanding, last Intel or AMD processor that had
    physical bus lock signal was released in Sep 2008. Likely not many
    still left operating and even fewer used in production.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Mon Feb 3 10:15:54 2025
    MitchAlsup1 wrote:
    On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:

    As you can see in the article below, the cost of NOT handling misaligned
    accesses in hardware is quite high in cpu clocks.

    To my eye, the incremental cost of adding hardware support for
    misaligned
    to the AGU and cache data path should be quite low. The alignment
    shifter
    is basically the same: assuming a 64-byte cache line, LD still has to
    shift any of the 64 bytes into position 0, and reverse for ST.

    A handful of gates to detect misalignedness and recognize the line and
    page crossing misalignments.

    The alignment shifters are twice as big.

    Oh, right, twice the muxes and wires but the critical path length
    should be the same - whatever a 64:1 mux is (3 gate delays?).
    So the larger aligner for misaligned shouldn't slow down the whole cache
    and penalize the normal aligned case.

    Now, while I accept these costs, I accept that others may not. I accept
    these costs because of the performance issues when I don't.

    The incremental cost is in a sequencer in the AGU for handling cache
    line and possibly virtual page straddles, and a small byte shifter to
    left shift the high order bytes. The AGU sequencer needs to know if the
    line straddles a page boundary, if not then increment the 6-bit physical
    line number within the 4 kB physical frame number, if yes then increment
    virtual page number and TLB lookup again and access the first line.
    (Slightly more if multiple page sizes are supported, but same idea.)
    For a load AGU merges the low and high fragments and forwards.

    I don't think there are line straddle consequences for coherence because
    there is no ordering guarantees for misaligned accesses.

    Generally stated as:: Misaligned accesses cannot be considered ATOMIC.

    That too (I thought of that after hitting send).
    What I was thinking of was: are there any coherence ordering issues if
    in order to take advantage of the cache's access pipeline,
    the AGU issues both accesses at once, low fragment first, high second,
    and the cache has hit-under-miss, and the low fragment misses while
    the high fragment hits, as the effect would be the equivalent of a
    LD-LD or ST-ST bypass.

    I don't immediately see a problem, but if there were then AGU would have
    to do each fragment synchronously which would double the access latency
    for misaligned loads.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Mon Feb 3 11:31:37 2025
    Michael S wrote:
    On Mon, 03 Feb 2025 13:49:46 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 2 Feb 2025 22:44:13 +0000, Chris M. Thomasson wrote:

    On 2/2/2025 10:51 AM, MitchAlsup1 wrote:
    On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
    -------------
    I don't think there are line straddle consequences for coherence
    because there is no ordering guarantees for misaligned accesses.

    Generally stated as:: Misaligned accesses cannot be considered
    ATOMIC.
    Try it on an x86/x64. Straddle a l2 cache line and use it with a
    LOCK'ed RMW. It should assert the BUS lock.
    Consider this approach when you have a cabinet of slid in servers,
    each server having 128 cores, the cabinet being cache coherent,
    and the cabinet having 4096 cores.

    Can you say "it donna scale" ??
    We (3Leaf Systems) learned that the hard way 20 years ago. AMD and
    Intel processors will sometimes assert the BUS lock under high
    contention for a target cache line, even in cases where the access is
    aligned and doesn't straddle a page boundary.


    According to my understanding, last Intel or AMD processor that had
    physical bus lock signal was released in Sep 2008. Likely not many
    still left operating and even fewer used in production.

    Both Intel and AMD current manuals refer to system wide bus locks under
    certain conditions, such as a LOCK RMW operation that straddles cache
    lines in order to guarantee backwards compatible system wide atomicity.
    Though the actual "bus locking" is likely done by broadcasting messages
    on the coherence network rather than a LOCK# wire that runs to all cores.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon Feb 3 16:43:25 2025
    On Mon, 3 Feb 2025 15:15:54 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:

    As you can see in the article below, the cost of NOT handling misaligned >>> accesses in hardware is quite high in cpu clocks.

    To my eye, the incremental cost of adding hardware support for
    misaligned
    to the AGU and cache data path should be quite low. The alignment
    shifter
    is basically the same: assuming a 64-byte cache line, LD still has to
    shift any of the 64 bytes into position 0, and reverse for ST.

    A handful of gates to detect misalignedness and recognize the line and
    page crossing misalignments.

    The alignment shifters are twice as big.

    Oh, right, twice the muxes and wires but the critical path length
    should be the same - whatever a 64:1 mux is (3 gate delays?).

    1 more gate of delay to double sifter width.

    So the larger aligner for misaligned shouldn't slow down the whole cache
    and penalize the normal aligned case.

    Tag :: TLB comparison takes longer than shifting.

    Now, while I accept these costs, I accept that others may not. I accept
    these costs because of the performance issues when I don't.

    The incremental cost is in a sequencer in the AGU for handling cache
    line and possibly virtual page straddles, and a small byte shifter to
    left shift the high order bytes. The AGU sequencer needs to know if the
    line straddles a page boundary, if not then increment the 6-bit physical >>> line number within the 4 kB physical frame number, if yes then increment >>> virtual page number and TLB lookup again and access the first line.
    (Slightly more if multiple page sizes are supported, but same idea.)
    For a load AGU merges the low and high fragments and forwards.

    I don't think there are line straddle consequences for coherence because >>> there is no ordering guarantees for misaligned accesses.

    Generally stated as:: Misaligned accesses cannot be considered ATOMIC.

    That too (I thought of that after hitting send).
    What I was thinking of was: are there any coherence ordering issues if
    in order to take advantage of the cache's access pipeline,

    When you don't cross a cache line, you CAN make them ATOMIC.
    When you cross a page boundary, you realistically cannot always*.
    It all depends on where you want to draw the line.

    Supporting misaligned access serves everyone.
    Supporting misaligned ATOMICs serves no one.

    (*) Consider the case where the second LD takes a miss in the TLB
    and we have a 100+ cycle table walk. You could do something like
    take a microfault and rerun after reloading the TLB--but this, then,
    opens up a side channel because the TLB was updated before the causing instruction retires. And since ATOMIC falls into the category of "do
    it right is better than do it fast" you should not.

    Consider another case where ATOMIC crosses a line boundary, and the
    second line ends up with an ECC error ?!?

    There are so may side cases to consider, than taking the whole lot of
    them and saying "no" is simply best. There is a good case for mis-
    aligned support, there is not such a case for misaligned ATOMICs.

    the AGU issues both accesses at once, low fragment first, high second,
    and the cache has hit-under-miss, and the low fragment misses while
    the high fragment hits, as the effect would be the equivalent of a
    LD-LD or ST-ST bypass.

    I don't immediately see a problem, but if there were then AGU would have
    to do each fragment synchronously which would double the access latency
    for misaligned loads.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Mon Feb 3 17:15:26 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    BGB <cr88192@gmail.com> writes:
    On 2/3/2025 12:55 AM, Anton Ertl wrote:
    Rather, have something like an explicit "__unaligned" keyword or
    similar, and then use the runtime call for these pointers.

    There are people who think that it is ok to compile *p to anything if
    p is not aligned, even on architectures that support unaligned
    accesses. At least one of those people recommended the use of
    memcpy(..., ..., sizeof(...)). Let's see what gcc produces on
    rv64gc (where unaligned accesses are guaranteed to work):

    [fedora-starfive:/tmp:111378] cat x.c
    #include <string.h>

    long uload(long *p)
    {
    long x;
    memcpy(&x,p,sizeof(long));
    return x;
    }
    [fedora-starfive:/tmp:111379] gcc -O -S x.c
    [fedora-starfive:/tmp:111380] cat x.s
    .file "x.c"
    .option nopic
    .text
    .align 1
    .globl uload
    .type uload, @function
    uload:
    addi sp,sp,-16
    lbu t1,0(a0)

    [...]

    With RISC-V, nobody ever knows what architecture he is compiling for...

    Did you tell gcc specifically that unsigned access was supported in
    the architecture you were using?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Mon Feb 3 12:46:14 2025
    Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 2/2/2025 10:45 AM, EricP wrote:
    Digging deeper with performance counters reveals executing each unaligned >>> load instruction results in ~505 executed instructions. P550 almost
    certainly doesn’t have hardware support for unaligned accesses.
    Rather, it’s likely raising a fault and letting an operating system
    handler emulate it in software."

    An emulation fault, or something similarly nasty...


    At that point, even turning any potentially unaligned load or store into
    a runtime call is likely to be a lot cheaper.

    There are lots of potentially unaligned loads and stores. There are
    very few actually unaligned loads and stores: On Linux-Alpha every
    unaligned access is logged by default, and the number of
    unaligned-access entries in the logs of our machines was relatively
    small (on average a few per day). So trapping actual unaligned
    accesses was faster than replacing potential unaligned accesses with
    code sequences that synthesize the unaligned access from aligned
    accesses.

    Of course, if the cost of unaligned accesses is that high, you will
    avoid them in cases like block copies where cheap unaligned accesses
    would otherwise be beneficial.

    - anton

    That is fine for code that is being actively maintained and backward
    data structure compatibility is not required (like those inside a kernel).

    However for x86 there was a few billion lines of legacy code that likely assumed 2-byte alignment, or followed the fp64 aligned to 32-bits advice,
    and a C language that mandates structs be laid out in memory exactly as specified (no automatic struct optimization). Also I seem to recall some
    amount of squawking about SIMD when it required naturally aligned buffers.
    As SIMD no longer requires alignment, presumably code no longer does so.

    Also in going from 32 to 64 bits, data structures that contain pointers
    now could find those 8-byte pointers aligned on 4-byte boundaries.

    While the Linux kernel may not use many misaligned values,
    I'd guess there is a lot of application code that does.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Mon Feb 3 17:54:16 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Michael S wrote:
    On Mon, 03 Feb 2025 13:49:46 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 2 Feb 2025 22:44:13 +0000, Chris M. Thomasson wrote:

    On 2/2/2025 10:51 AM, MitchAlsup1 wrote:
    On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
    -------------
    I don't think there are line straddle consequences for coherence >>>>>>> because there is no ordering guarantees for misaligned accesses. >>>>>>>
    Generally stated as:: Misaligned accesses cannot be considered
    ATOMIC.
    Try it on an x86/x64. Straddle a l2 cache line and use it with a
    LOCK'ed RMW. It should assert the BUS lock.
    Consider this approach when you have a cabinet of slid in servers,
    each server having 128 cores, the cabinet being cache coherent,
    and the cabinet having 4096 cores.

    Can you say "it donna scale" ??
    We (3Leaf Systems) learned that the hard way 20 years ago. AMD and
    Intel processors will sometimes assert the BUS lock under high
    contention for a target cache line, even in cases where the access is
    aligned and doesn't straddle a page boundary.


    According to my understanding, last Intel or AMD processor that had
    physical bus lock signal was released in Sep 2008. Likely not many
    still left operating and even fewer used in production.

    Both Intel and AMD current manuals refer to system wide bus locks under >certain conditions, such as a LOCK RMW operation that straddles cache
    lines in order to guarantee backwards compatible system wide atomicity. >Though the actual "bus locking" is likely done by broadcasting messages
    on the coherence network rather than a LOCK# wire that runs to all cores.


    Indeed. Our (3Leaf) ASIC was connected to an HT port on the AMD
    processors and the QPI port on the Intel processors. As we were
    extending the coherency domain across QDR infiniband, a system wide
    bus lock seriously degraded performance. 800ns r/t for the HT
    version (IB DDR), about 400ns r/t for the QPI version (IB QDR
    with better cut-through latency in the IB switch), with a large
    line cache on the ASIC (and the ability to cache entire pages
    in local DRAM).

    This was in the 2004-2010 timeframe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Mon Feb 3 19:41:10 2025
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    That is fine for code that is being actively maintained and backward
    data structure compatibility is not required (like those inside a kernel).

    However for x86 there was a few billion lines of legacy code that likely assumed 2-byte alignment, or followed the fp64 aligned to 32-bits advice,
    and a C language that mandates structs be laid out in memory exactly as specified (no automatic struct optimization). Also I seem to recall some amount of squawking about SIMD when it required naturally aligned buffers.
    As SIMD no longer requires alignment, presumably code no longer does so.

    Looking at Intel's optimization manual, they state in
    "15.6 DATA ALIGNMENT FOR INTEL® AVX"

    "Assembly/Compiler Coding Rule 65. (H impact, M generality) Align
    data to 32-byte boundary when possible. Prefer store alignment
    over load alignment."

    and further down, about AVX-512,

    "18.23.1 Align Data to 64 Bytes"

    "Aligning data to vector length is recommended. For best results,
    when using Intel AVX-512 instructions, align data to 64 bytes.

    When doing a 64-byte Intel AVX-512 unaligned load/store, every
    load/store is a cache-line split, since the cache-line is 64
    bytes. This is double the cache line split rate of Intel AVX2
    code that uses 32-byte registers. A high cache-line split rate in memory-intensive code can cause poor performance."

    This sounds reasonable, and good advice if you want to go
    down SIMD lane.

    Also in going from 32 to 64 bits, data structures that contain pointers
    now could find those 8-byte pointers aligned on 4-byte boundaries.

    This is mandated by the relevant ABI, and ABIs usually mandate
    alignment on natural boundaries.


    While the Linux kernel may not use many misaligned values,
    I'd guess there is a lot of application code that does.

    Unless it is generating external binary data (a _very_ bad idea,
    XDR was developed for a reason), there is no big reason to use
    unaligned data, unless somebody is playing fast and loose
    with C pointer types, and that is a bad idea anyway.

    Alternatively, a compiler could use it to implement somthing like
    memcpy or memmove when it knows that unaligned accesses are safe.

    But it would be really interesting to have a access to a system
    where unaligned accesses trap, in order to find (and fix) ABI
    issues and some undefined behavior on the C side.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Mon Feb 3 21:03:45 2025
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The OS must also be able to keep both pages in physical memory until
    the access is complete, or there will be no progress. Should not be a
    problem these days, but the 48 pages or so potentially needed by VAX
    complicated the OS.

    48 pages? What instruction would need that?

    I've seen it somewhere but dont't remember where:

    One candidate would be the POLY (spelling?) polynomial evaluator with
    all the arguments (indirectly?) loaded from misaligned addresses, all straddling page bounaries?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Feb 3 21:04:37 2025
    According to Terje Mathisen <terje.mathisen@tmsw.no>:
    48 pages? What instruction would need that?

    I've seen it somewhere but dont't remember where:

    One candidate would be the POLY (spelling?) polynomial evaluator with
    all the arguments (indirectly?) loaded from misaligned addresses, all >straddling page bounaries?

    No, POLY only had three arguments, the argument, the degree, and the
    table of multipliers. The table could be arbitrarily long but the
    instruction was restartable, saving the partial result on the stack
    and setting the FPD (first part done) flag for when it resumes so it
    only had to be able to load one table entry at a time.

    MOVTC or MOVTUC were the worst, with six arguments, all of which could
    have an indirect address and five of which could cross page
    boundaries.

    But it occurs to me that those instructions are also restartable, so
    that only a single byte of the source and destination arguments need
    to be addressable at a time. There's six possible indirect adddresses
    which can cross page boundaries for 12 pages, two lengths and a table
    that can cross a page boundary for six more, and the source and
    destination and fill, three more, and the instruction, two more.
    That's a total of 23 pages, double it for the P0 or P1 page tables,
    and it's only 46 pages.

    That's still kind of a lot.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Mon Feb 3 21:25:58 2025
    On Mon, 3 Feb 2025 21:04:37 +0000, John Levine wrote:

    According to Terje Mathisen <terje.mathisen@tmsw.no>:
    48 pages? What instruction would need that?

    I've seen it somewhere but dont't remember where:

    One candidate would be the POLY (spelling?) polynomial evaluator with
    all the arguments (indirectly?) loaded from misaligned addresses, all >>straddling page bounaries?

    No, POLY only had three arguments, the argument, the degree, and the
    table of multipliers. The table could be arbitrarily long but the instruction was restartable, saving the partial result on the stack
    and setting the FPD (first part done) flag for when it resumes so it
    only had to be able to load one table entry at a time.

    MOVTC or MOVTUC were the worst, with six arguments, all of which could
    have an indirect address and five of which could cross page
    boundaries.

    But it occurs to me that those instructions are also restartable, so
    that only a single byte of the source and destination arguments need
    to be addressable at a time. There's six possible indirect adddresses
    which can cross page boundaries for 12 pages, two lengths and a table
    that can cross a page boundary for six more, and the source and
    destination and fill, three more, and the instruction, two more.
    That's a total of 23 pages, double it for the P0 or P1 page tables,
    and it's only 46 pages.

    That's still kind of a lot.

    Basically, VAX taught us why we did not want to do "all that" in
    a single instruction; while Intel 432 taught us why we did not bit
    aligned decoders (and a lot of other things).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Feb 3 21:57:37 2025
    On Mon, 3 Feb 2025 21:40:21 +0000, BGB wrote:

    On 2/3/2025 1:41 PM, Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    That is fine for code that is being actively maintained and backward
    data structure compatibility is not required (like those inside a
    kernel).

    However for x86 there was a few billion lines of legacy code that likely >>> assumed 2-byte alignment, or followed the fp64 aligned to 32-bits
    advice,
    and a C language that mandates structs be laid out in memory exactly as
    specified (no automatic struct optimization). Also I seem to recall some >>> amount of squawking about SIMD when it required naturally aligned
    buffers.
    As SIMD no longer requires alignment, presumably code no longer does so.

    Looking at Intel's optimization manual, they state in
    "15.6 DATA ALIGNMENT FOR INTEL® AVX"

    "Assembly/Compiler Coding Rule 65. (H impact, M generality) Align
    data to 32-byte boundary when possible. Prefer store alignment
    over load alignment."

    and further down, about AVX-512,

    "18.23.1 Align Data to 64 Bytes"

    "Aligning data to vector length is recommended. For best results,
    when using Intel AVX-512 instructions, align data to 64 bytes.

    When doing a 64-byte Intel AVX-512 unaligned load/store, every
    load/store is a cache-line split, since the cache-line is 64
    bytes. This is double the cache line split rate of Intel AVX2
    code that uses 32-byte registers. A high cache-line split rate in
    memory-intensive code can cause poor performance."

    This sounds reasonable, and good advice if you want to go
    down SIMD lane.


    This is, ironically, a place where SIMD via ganged registers has an
    advantage over SIMD via large monolithic registers.

    Iroincally^2 is that vVM allows each implementation to decide on how
    many and how wide the SIMD register are. LBI/O might have 8 128-bit
    flip-flops, while GBOoO might have 32 512-bit flip-flops. All running
    the same binary and all running that same binary as fast as any binary
    that that machine could run; in addition, HW looks at the loop index
    and possibly predication) to create masks on the lanes of execution.

    SW ASCII should describe the calculations to be performed,
    The compiler should produce a vVM loop for those calculations
    in loops.
    Any implementation should run that vVM loop as fast as it can.

    See, no change to ISA and you still get 98^ of SIMD you know and love
    and the number and width of the registers is an implementation variable!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Feb 4 01:49:04 2025
    According to MitchAlsup1 <mitchalsup@aol.com>:
    That's a total of 23 pages, double it for the P0 or P1 page tables,
    and it's only 46 pages.

    That's still kind of a lot.

    Basically, VAX taught us why we did not want to do "all that" in
    a single instruction;

    Yes, VAX was brilliantly optimized for hand-coded assembler on a very
    memory constrained system where microcode was much faster than main
    memory. Too bad that was obsolete by the time they shipped it.

    while Intel 432 taught us why we did not bit
    aligned decoders (and a lot of other things).

    It was certainly an interesting experiment in yet another way that
    Intel wanted programmers to use their computers and the programmers
    said, naah.


    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Mon Feb 3 23:49:57 2025
    MitchAlsup1 wrote:

    Basically, VAX taught us why we did not want to do "all that" in
    a single instruction; while Intel 432 taught us why we did not bit
    aligned decoders (and a lot of other things).

    I case people are interested...

    [paywalled]
    The Instruction Decoding Unit for the VLSI 432 General Data Processor, 1981 https://ieeexplore.ieee.org/abstract/document/1051633/

    The benchmarks in table 1(a) below tell it all:
    a 4 MHz 432 is 1/15 to 1/20 the speed (slower) than a 5 MHz VAX/780,
    1/4 to 1/7 speed than a 8 MHz 68000 or 5 MHz 8086

    A Performance Evaluation of The Intel iAPX 432, 1982 https://dl.acm.org/doi/pdf/10.1145/641542.641545

    And the reasons are covered here:

    Performance Effects of Architectural Complexity in the Intel 432, 1988 https://www.princeton.edu/~rblee/ELE572Papers/Fall04Readings/I432.pdf

    Bob Colwell, one of the authors of the third paper, later joined
    Intel as a senior architect and was involved in the development of the
    P6 core used in the Pentium Pro, Pentium II, and Pentium III microprocessors, and designs derived from it are used in the Pentium M, Core Duo and
    Core Solo, and Core 2.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Feb 10 12:23:52 2025
    On Sat, 08 Feb 2025 08:11:04 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    keylocks3.c compiles without warning on clang, but the result usually segfaults (but sometime does not, e.g., in the timed run on Zen4; it segfaults in other runs on Zen4). I have not investigated why this
    happens, I just did not include results from runs where it segfaulted;
    and I tried additional runs for keylocks3-512 on Zen4 in order to have
    one result there.


    I got random crashes both on keylocks3.c and on keylocks4.c.
    My test bench includes testing of 1024 combinations of key and lock
    lengths, so it likely catches some problematic cases in keylocks4.c.
    that you overlooked.
    The fix appears to be addition of alignment requirement for locks1[].
    _Alignas(UNROLL*4) int locks1[nlocks1];
    Cleaner solution is to declare locks1[] as
    vu locks1[nlocks1/UNROLL];
    But I wanted to modify your code as little as possible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Stump@21:1/5 to Anton Ertl on Mon Feb 10 18:47:24 2025
    In article <2025Feb4.190738@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    http://gcc.gnu.org/bugzilla is your friend.

    In my experience it's a waste of time:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25285

    Customer still unhappy. :-(

    Yeah, I never close bugs I don't want to work on. I always just left
    them there, maybe when I'm bored, maybe someone else wants to do the
    work, maybe someone does a theoretic fixup and handles the whole
    problem instead and magically, things get better.

    Yeah, bug reporting can be hit or miss at times.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Stump@21:1/5 to tkoenig@netcologne.de on Mon Feb 10 19:06:58 2025
    In article <vntrnt$205ld$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93765

    That is stuck in WAITING.

    Pushed it along.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Mon Feb 17 16:19:18 2025
    On Mon, 17 Feb 2025 9:37:57 +0000, Terje Mathisen wrote:

    Marcus wrote:
    On 2025-02-03, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 2/2/2025 10:45 AM, EricP wrote:
    Digging deeper with performance counters reveals executing each
    unaligned
    load instruction results in ~505 executed instructions. P550 almost
    certainly doesn’t have hardware support for unaligned accesses. >>>>> Rather, it’s likely raising a fault and letting an operating system
    handler emulate it in software."


    An emulation fault, or something similarly nasty...


    At that point, even turning any potentially unaligned load or store into >>>> a runtime call is likely to be a lot cheaper.

    There are lots of potentially unaligned loads and stores.  There are
    very few actually unaligned loads and stores: On Linux-Alpha every
    unaligned access is logged by default, and the number of
    unaligned-access entries in the logs of our machines was relatively
    small (on average a few per day).  So trapping actual unaligned
    accesses was faster than replacing potential unaligned accesses with
    code sequences that synthesize the unaligned access from aligned
    accesses.

    If you compile regular C/C++ code that does not intentionally do any
    nasty stuff, you will typically have zero unaligned loads stores.

    My machine still does not support unaligned accesses in hardware (it's
    on the todo list), and it can run an awful lot of software without
    problems.

    The problem arises when the programmer *deliberately* does unaligned
    loads and stores in order to improve performance. Or rather, if the
    programmer knows that the hardware supports unaligned loads and stores,
    he/she can use that to write faster code in some special cases.

    No, the real problem is when a compiler want to auto-vectorize any code working with 1/2/4/8 byte items: All of a sudden the alignment
    requirement went from the item stride to the vector register stride
    (16/32/64 bytes).

    If you provide misaligned access to SIMD registers, why not provide
    misaligned access to all memory references !?!

    I made this argument several times in my career.

    The only way this can work is to have the compiler control _all_
    allocations to make sure they are properly aligned, including code in libraries, or the compiler will be forced to use vector load/store
    operations which do allow unaligned access.

    Either the entire environment has to be "air tight" or the HW
    provides misaligned access at low cost. {{Good luck on the air
    tight thing...}}

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Mon Feb 17 18:34:03 2025
    MitchAlsup1 wrote:
    On Mon, 17 Feb 2025 9:37:57 +0000, Terje Mathisen wrote:

    Marcus wrote:
    On 2025-02-03, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 2/2/2025 10:45 AM, EricP wrote:
    Digging deeper with performance counters reveals executing each
    unaligned
    load instruction results in ~505 executed instructions. P550 almost >>>>>> certainly doesn’t have hardware support for unaligned >>>>>> accesses.
    Rather, it’s likely raising a fault and letting an >>>>>> operating system
    handler emulate it in software."


    An emulation fault, or something similarly nasty...


    At that point, even turning any potentially unaligned load or store >>>>> into
    a runtime call is likely to be a lot cheaper.

    There are lots of potentially unaligned loads and stores.  There are >>>> very few actually unaligned loads and stores: On Linux-Alpha every
    unaligned access is logged by default, and the number of
    unaligned-access entries in the logs of our machines was relatively
    small (on average a few per day).  So trapping actual unaligned
    accesses was faster than replacing potential unaligned accesses with
    code sequences that synthesize the unaligned access from aligned
    accesses.

    If you compile regular C/C++ code that does not intentionally do any
    nasty stuff, you will typically have zero unaligned loads stores.

    My machine still does not support unaligned accesses in hardware (it's
    on the todo list), and it can run an awful lot of software without
    problems.

    The problem arises when the programmer *deliberately* does unaligned
    loads and stores in order to improve performance. Or rather, if the
    programmer knows that the hardware supports unaligned loads and stores,
    he/she can use that to write faster code in some special cases.

    No, the real problem is when a compiler want to auto-vectorize any code
    working with 1/2/4/8 byte items: All of a sudden the alignment
    requirement went from the item stride to the vector register stride
    (16/32/64 bytes).

    If you provide misaligned access to SIMD registers, why not provide misaligned access to all memory references !?!

    I made this argument several times in my career.

    The only way this can work is to have the compiler control _all_
    allocations to make sure they are properly aligned, including code in
    libraries, or the compiler will be forced to use vector load/store
    operations which do allow unaligned access.

    Either the entire environment has to be "air tight" or the HW
    provides misaligned access at low cost. {{Good luck on the air
    tight thing...}}

    This is just one of many details where we've agreed for a decade or two (three?). Some of them you persuaded me you were right, I don't remember
    any obvious examples of the opposite, but most we figured out
    independently. :-)

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Mon Feb 17 20:01:35 2025
    On Mon, 17 Feb 2025 17:34:03 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Mon, 17 Feb 2025 9:37:57 +0000, Terje Mathisen wrote:

    Marcus wrote:
    On 2025-02-03, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 2/2/2025 10:45 AM, EricP wrote:
    Digging deeper with performance counters reveals executing each
    unaligned
    load instruction results in ~505 executed instructions. P550 almost >>>>>>> certainly doesn’t have hardware support for unaligned >>>>>>> accesses.
    Rather, it’s likely raising a fault and letting an >>>>>>> operating system
    handler emulate it in software."


    An emulation fault, or something similarly nasty...


    At that point, even turning any potentially unaligned load or store >>>>>> into
    a runtime call is likely to be a lot cheaper.

    There are lots of potentially unaligned loads and stores.  There are >>>>> very few actually unaligned loads and stores: On Linux-Alpha every
    unaligned access is logged by default, and the number of
    unaligned-access entries in the logs of our machines was relatively
    small (on average a few per day).  So trapping actual unaligned
    accesses was faster than replacing potential unaligned accesses with >>>>> code sequences that synthesize the unaligned access from aligned
    accesses.

    If you compile regular C/C++ code that does not intentionally do any
    nasty stuff, you will typically have zero unaligned loads stores.

    My machine still does not support unaligned accesses in hardware (it's >>>> on the todo list), and it can run an awful lot of software without
    problems.

    The problem arises when the programmer *deliberately* does unaligned
    loads and stores in order to improve performance. Or rather, if the
    programmer knows that the hardware supports unaligned loads and stores, >>>> he/she can use that to write faster code in some special cases.

    No, the real problem is when a compiler want to auto-vectorize any code
    working with 1/2/4/8 byte items: All of a sudden the alignment
    requirement went from the item stride to the vector register stride
    (16/32/64 bytes).

    If you provide misaligned access to SIMD registers, why not provide
    misaligned access to all memory references !?!

    I made this argument several times in my career.

    The only way this can work is to have the compiler control _all_
    allocations to make sure they are properly aligned, including code in
    libraries, or the compiler will be forced to use vector load/store
    operations which do allow unaligned access.

    Either the entire environment has to be "air tight" or the HW
    provides misaligned access at low cost. {{Good luck on the air
    tight thing...}}

    This is just one of many details where we've agreed for a decade or two (three?). Some of them you persuaded me you were right, I don't remember
    any obvious examples of the opposite, but most we figured out
    independently. :-)

    Although I cannot name a given thing where your argument permanently
    swayed my opinion, I am sure that there are many cases where you have!

    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Feb 18 02:55:33 2025
    On Tue, 18 Feb 2025 1:00:18 +0000, BGB wrote:

    On 2/14/2025 3:52 PM, MitchAlsup1 wrote:
    ------------
    It would take LESS total man-power world-wide and over-time to
    simply make HW perform misaligned accesses.


    I think the usual issue is that on low-end hardware, it is seen as
    "better" to skip out on misaligned access in order to save some cost in
    the L1 cache.

    Though, not sure how this mixes with 16/32 ISAs, given if one allows misaligned 32-bit instructions, and a misaligned 32-bit instruction to
    cross a cache-line boundary, one still has to deal with essentially the
    same issues.

    Strategy for low end processors::
    a) detect misalignment in AGEN
    b) when misaligned, AGEN takes 2 cycles for the two addresses
    c) when misaligned, DC is accessed twice
    d) When misaligned, LD align is performed twice to merge data

    Another related thing I can note is internal store-forwarding within the
    L1 D$ to avoid RAW and WAW penalties for multiple accesses to the same
    cache line.

    IMHO:: Low end processors should not be doing ST->LD forwarding.

    ---------------------

    Say, it less convoluted to do, say:
    MOV.X R24, (SP, 0)
    MOV.X R26, (SP, 16)
    MOV.X R28, (SP, 32)
    MOV.X R30, (SP, 48)

    These still look like LDs to me.

    -----------------
    Then again, I have heard that apparently there are libraries that rely
    on the global-rounding-mode behavior, but I have also heard of such
    libraries having issues or non-determinism when mixed with other
    libraries which try to set a custom rounding mode when these modes
    disagree.


    I prefer my strategy instead:
    FADD/FSUB/FMUL:
    Hard-wired Round-Nearest / RNE.
    Does not modify FPU flags.

    It takes Round Nearest Odd to perform Kahan-Babashuka Summation.
    That is:: comply with IEEE 754-2019

    FADDG/FSUBG/FMULG:
    Dynamic Rounding;
    May modify FPU flags.

    Can note that RISC-V burns 3 bits for FPU instructions always encoding a rounding mode (whereas in my ISA, encoding a rounding mode other than
    RNE or DYN requiring a 64-bit encoding).

    Oh what fun, another RISC-V encoding mistake...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Mon Feb 24 11:52:38 2025
    Michael S wrote:
    On Sun, 23 Feb 2025 11:13:53 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    It looks to me that Vivado intends that after you get your basic
    design working, this module optimization is *exactly* what one is
    supposed to do.

    In this case the prototype design establishes that you need multiple
    64-bit adders and the generic ones synthesis spits out are slow.
    So you isolate that module off, use Verilog to drive the basic LE
    selections, then iterate doing relative LE placement specifiers,
    route the module, and when you get the fastest 64-bit adder you can
    then lock down the netlist and save the module design.

    Now you have a plug-in 64-bit adder module that runs at (I don't know
    the speed difference between Virtex and your Spartan-7 so wild guess)
    oh, say, 4 ns, to use multiple places... fetch, decode, alu, agu.

    Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
    isolate that module, optimize placement, route, lock down netlist,
    and now you have a 5 ns plug-in ALU module.

    Doing this you build up your own IP library of optimized hardware
    modules.

    As more and more modules are optimized the system synthesis gets
    faster because much of the fine grain work and routing is already
    done.



    It sounds like your 1st hand FPGA design experience is VERY outdated.

    Never have, likely never will.
    Nothing against them - looks easier than wire-wrapping TTL and 4000 CMOS. Though people do seem to spend an awful lot of time working around
    certain deficiencies like the lack of >1 write ports on register files,
    and the lack of CAM's. One would think market forces would induce
    at least one supplier to add these and take the fpga market by storm.

    Also fpga's do seem prone to monopolistic locked-in pricing
    (though not really different from any relational database vendor).
    At least with TTL one could do an RFQ to 5 or 10 different suppliers.

    I'm just trying to figure out what these other folks are doing to get
    bleeding edge performance from essentially the same tools and similar chips.

    I assume you are referring to the gui IDE interface for things like
    floor planning where you click on a LE cells and set some attributes.
    I also think I saw reference to locking down parts of the net list.
    But there are a lot of documents to go through.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Mon Feb 24 19:28:13 2025
    On Mon, 24 Feb 2025 11:52:38 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Sun, 23 Feb 2025 11:13:53 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    It looks to me that Vivado intends that after you get your basic
    design working, this module optimization is *exactly* what one is
    supposed to do.

    In this case the prototype design establishes that you need
    multiple 64-bit adders and the generic ones synthesis spits out
    are slow. So you isolate that module off, use Verilog to drive the
    basic LE selections, then iterate doing relative LE placement
    specifiers, route the module, and when you get the fastest 64-bit
    adder you can then lock down the netlist and save the module
    design.

    Now you have a plug-in 64-bit adder module that runs at (I don't
    know the speed difference between Virtex and your Spartan-7 so
    wild guess) oh, say, 4 ns, to use multiple places... fetch,
    decode, alu, agu.

    Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
    isolate that module, optimize placement, route, lock down netlist,
    and now you have a 5 ns plug-in ALU module.

    Doing this you build up your own IP library of optimized hardware
    modules.

    As more and more modules are optimized the system synthesis gets
    faster because much of the fine grain work and routing is already
    done.



    It sounds like your 1st hand FPGA design experience is VERY
    outdated.

    Never have, likely never will.
    Nothing against them - looks easier than wire-wrapping TTL and 4000
    CMOS. Though people do seem to spend an awful lot of time working
    around certain deficiencies like the lack of >1 write ports on
    register files, and the lack of CAM's. One would think market forces
    would induce at least one supplier to add these and take the fpga
    market by storm.


    Your view is probably skewed by talking to soft core hobbyists.
    Please realize that most professionals do not care about
    high-performance soft core. Soft core is for control plane functions
    rather than for data plane. Important features are ease of use,
    reliability, esp. of software tools and small size. Performance is
    rated low. Performance per clock is rated even lower. So, professional
    do not develop soft cores by themselves. And OTS cores that they use
    are not superscalar. Quite often not even fully pipelined.
    It means, no, small SRAM banks with two independent write ports is not
    a feature that FPGA pros would be excited about.

    Also fpga's do seem prone to monopolistic locked-in pricing
    (though not really different from any relational database vendor).

    Cheap Chinese clones of X&A FPGAs from late 2000s and very early 2010s certainly exist. I didn't encounter Chinese clones of slightly newer
    devices, like Xilinx 7-series. But I didn't look hard for them. So,
    wouldn't be surprised if they exist, too.
    Right now, and almost full decade back, neither X nor A cares about low
    end. They just continue to ship old chips, mostly charging old price or
    rising a little.

    At least with TTL one could do an RFQ to 5 or 10 different suppliers.

    I'm just trying to figure out what these other folks are doing to get bleeding edge performance from essentially the same tools and similar
    chips.

    I assume you are referring to the gui IDE interface for things like
    floor planning where you click on a LE cells and set some attributes.
    I also think I saw reference to locking down parts of the net list.
    But there are a lot of documents to go through.


    No, I mean florplanning, as well as most other manual physical-level optimization are not used at all in 99% percents of FPGA designs that
    started after year 2005.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Mon Feb 24 22:33:03 2025
    On Mon, 24 Feb 2025 19:52:49 +0000, Robert Finch wrote:

    On 2025-02-24 12:28 p.m., Michael S wrote:
    On Mon, 24 Feb 2025 11:52:38 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Sun, 23 Feb 2025 11:13:53 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    It looks to me that Vivado intends that after you get your basic
    design working, this module optimization is *exactly* what one is
    supposed to do.

    In this case the prototype design establishes that you need
    multiple 64-bit adders and the generic ones synthesis spits out
    are slow. So you isolate that module off, use Verilog to drive the
    basic LE selections, then iterate doing relative LE placement
    specifiers, route the module, and when you get the fastest 64-bit
    adder you can then lock down the netlist and save the module
    design.

    Now you have a plug-in 64-bit adder module that runs at (I don't
    know the speed difference between Virtex and your Spartan-7 so
    wild guess) oh, say, 4 ns, to use multiple places... fetch,
    decode, alu, agu.

    Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
    isolate that module, optimize placement, route, lock down netlist,
    and now you have a 5 ns plug-in ALU module.

    Doing this you build up your own IP library of optimized hardware
    modules.

    As more and more modules are optimized the system synthesis gets
    faster because much of the fine grain work and routing is already
    done.



    It sounds like your 1st hand FPGA design experience is VERY
    outdated.

    Never have, likely never will.
    Nothing against them - looks easier than wire-wrapping TTL and 4000
    CMOS. Though people do seem to spend an awful lot of time working
    around certain deficiencies like the lack of >1 write ports on
    register files, and the lack of CAM's. One would think market forces
    would induce at least one supplier to add these and take the fpga
    market by storm.


    Your view is probably skewed by talking to soft core hobbyists.
    Please realize that most professionals do not care about
    high-performance soft core. Soft core is for control plane functions
    rather than for data plane. Important features are ease of use,
    reliability, esp. of software tools and small size. Performance is
    rated low. Performance per clock is rated even lower. So, professional
    do not develop soft cores by themselves. And OTS cores that they use
    are not superscalar. Quite often not even fully pipelined.
    It means, no, small SRAM banks with two independent write ports is not
    a feature that FPGA pros would be excited about.

    Also fpga's do seem prone to monopolistic locked-in pricing
    (though not really different from any relational database vendor).

    Cheap Chinese clones of X&A FPGAs from late 2000s and very early 2010s
    certainly exist. I didn't encounter Chinese clones of slightly newer
    devices, like Xilinx 7-series. But I didn't look hard for them. So,
    wouldn't be surprised if they exist, too.
    Right now, and almost full decade back, neither X nor A cares about low
    end. They just continue to ship old chips, mostly charging old price or
    rising a little.

    At least with TTL one could do an RFQ to 5 or 10 different suppliers.

    I'm just trying to figure out what these other folks are doing to get
    bleeding edge performance from essentially the same tools and similar
    chips.

    I assume you are referring to the gui IDE interface for things like
    floor planning where you click on a LE cells and set some attributes.
    I also think I saw reference to locking down parts of the net list.
    But there are a lot of documents to go through.


    No, I mean florplanning, as well as most other manual physical-level
    optimization are not used at all in 99% percents of FPGA designs that
    started after year 2005.

    Respecting I do not know that much about the work environment of FPGA developers:
    I have thought of FPGAs as more of a prototyping tool, or to be used in one-off designs, proof-of-concept type things. In those cases one
    probably does not care too much about manual operations, as was said one would be more interested in productivity of developers that comes from reliable tools and being able to deal with things at a high level.

    The vendor’s have a number of pre-made components that can be plugged
    into a design making it possible to sketch out a design very quickly
    with a couple of caveats. One being one might be stuck to a particular vendor.

    CAMs can easily be implemented in FPGAs although they may have
    multi-cycle latency.

    A CAM is a vector of XOR gate inputs that feed an AND gate.

    A 5-bit CAM with valid bit is 3-gates in CMOS and 2-gates of delay.
    It is only when there are lots of bits being CAMed does the latency
    increase markedly -- OR when there are lots of entries being CAMed
    but this is a FAN-IN buffering problem not a gate delay or gate logic
    problem.

    One has only to research CAM implementation in
    FPGAs. Register files with multiple ports are easily implemented with replication.

    Read ports can be added by replication, write ports cannot.

    It may be nice to see a CAM component in a vendor library. Register files sometimes have bypassing requirements that might make it challenging to develop a generic component.

    Address CAMs are generally done in the "Can't be the same as" or
    Is exactly" converting significant amounts of logic as Tag-compare.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Michael S on Tue Feb 11 19:26:24 2025
    On Sun, 9 Feb 2025 02:57:45 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Sat, 08 Feb 2025 17:46:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sat, 08 Feb 2025 08:11:04 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Or by my own pasting mistake. I am still not sure whom to blame.
    The mistake was tiny - absence of // at the begining of one line,
    but enough to not compile. Trying it for a second time:

    Now it's worse, it's quoted-printable. E.g.:

    if (li >=3D len || li <=3D 0)

    Some newsreaders can decode this, mine does not.

    First cycles (which eliminates worries about turbo modes) and
    instructions, then usec/call.
    =20

    I don't understand that.
    For original code optimized by clang I'd expect 22,000 cycles and
    5.15 usec per call on Haswell. You numbers don't even resamble
    anything like that.

    My cycle numbers are for the whole program that calls keylocks()
    100_000 times.

    If you divide the cycles by 100000, you get 21954 for clang
    keylocks1-256, which is what you expect.

    instructions
    5_779_542_242 gcc avx2 1 =20
    3_484_942_148 gcc avx2 2 8=20
    5_885_742_164 gcc avx2 3 8=20
    7_903_138_230 clang avx2 1 =20
    7_743_938_183 clang avx2 2 8?
    3_625_338_104 clang avx2 3 8?=20
    4_204_442_194 gcc 512 1 =20
    2_564_142_161 gcc 512 2 32
    3_061_042_178 gcc 512 3 16
    7_703_938_205 clang 512 1 =20
    3_402_238_102 clang 512 2 16?
    3_320_455_741 clang 512 3 16?
    =20

    I don't understand these numbers either. For original clang, I'd
    expect 25,000 instructions per call.

    clang keylocks1-256 performs 79031 instructions per call (divide the
    number given by 100000 calls). If you want to see why that is, you
    need to analyse the code produced by clang, which I did only for
    select cases.

    Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
    Which could be due to differences in measurements methodology - I >reported median of 11 runs, you seems to report average.

    I just report one run with 100_000 calls, and just hope that the
    variation is small:-) In my last refereed paper I use 30 runs and
    median, but I don't go to these lengths here; the cycles seem pretty repeatable.

    On the Golden Cove of a Core i3-1315U (compared to the best
    result by Terje Mathisen on a Core i7-1365U; the latter can run
    up to 5.2GHz according to Intel, whereas the former can
    supposedly run up to 4.5GHz; I only ever measured at most 3.8GHz
    on our NUC, and this time as well):
    =20

    I always thought that NUCs have better cooling than all, but
    high-end laptops. Was I wrong? Such slowness is disappointing.

    The cooling may be better or not, that does not come into play here,
    as it never reaches higher clocks, even when it's cold; E-cores also
    stay 700MHz below their rated turbo speed, even when it's the only
    loaded core. One theory I have is that one option we set up in the
    BIOS has the effect of limiting turbo speed, but it has not been
    important enough to test.

    5.25us Terje Mathisen's Rust code compiled by clang (best on the
    1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
    4.17us gcc keylocks1-256 on a 3.8GHz 1315U
    3.16us gcc keylocks2-256 on a 3.8GHz 1315U
    2.38us clang keylocks2-512 on a 3.8GHz 1315U
    =20

    So, for the best-performing variant IPC of Goldeen Cove is
    identical to ancient Haswell?

    Actually worse:

    For clang keylocks2-512 Haswell has 3.73 IPC, Golden Cove 3.63.

    That's very disappointing. Haswell has 4-wide front
    end and majority of AVX2 integer instruction is limited to
    throughput of two per clock. Golden Cove has 5+ wide front end and
    nearly all AVX2 integer instruction have throughput of three per
    clock. Could it be that clang introduced some sort of latency
    bottleneck?

    As far as I looked into the code, I did not see such a bottleneck.
    Also, Zen4 has significantly higher IPC on this variant (5.36 IPC
    for clang keylocks2-256), and I expect that it would suffer from a
    general latency bottleneck, too. Rocket Lake is also faster on
    this program than Haswell and Golden Cove. It seems to be just
    that this program rubs Golden Cove the wrong way.

    I would have expected the clang keylocks1-256 to run slower,
    because the compiler back-end is the same and the 1315U is
    slower. Measuring cycles looks more relevant for this benchmark
    to me than measuring time, especially on this core where AVX-512
    is disabled and there is no AVX slowdown.
    =20

    I prefer time, because at the end it's the only thing that matter.


    True, and certainly, when stuff like AVX-512 license-based
    downclocking or thermal or power limits come into play (and are
    relevant for the measurement at hand), one has to go there. But
    then you can only compare code running on the same kind of machine, configured the same way. Or maybe just running on the same
    machine:-). But then, the generality of the results is
    questionable.

    - anton

    Back to original question of the cost of misalignment.
    I modified original code to force alignment in the inner loop:

    #include <stdint.h>
    #include <string.h>

    int foo_tst(const uint32_t* keylocks, int len, int li)
    {
    if (li <= 0 || len <= li)
    return 0;

    int lix = (li + 31) & -32;
    _Alignas(32) uint32_t tmp[lix];
    memcpy(tmp, keylocks, li*sizeof(*keylocks));
    if (lix > li)
    memset(&tmp[li], 0, (lix-li)*sizeof(*keylocks));

    int res = 0;
    for (int i = li; i < len; ++i) {
    uint32_t lock = keylocks[i];
    for (int k = 0; k < lix; ++k)
    res += (lock & tmp[k])==0;
    }
    return res - (lix-li)*(len-li);
    }

    Compiled with 'clang -O3 -march=haswell'
    On the same Haswell Xeon it runs at 2.841 usec/call, i.e. almost
    twice faster than original and only 1.3x slower than horizontally
    unrolled variants.

    So, at least on Haswell, unaligned AVX256 loads are slow.


    Above I came to completely wrong conclusion.

    The speed up I had seen has relatively little to with alignment.
    In particular this code is only 15% faster than original on Haswell,
    22% faster on Skylake and 1-2% faster on Zen3:

    int foo_tst(const uint32_t* keylocks, int len, int li)
    {
    if (li <= 0 || len <= li)
    return 0;

    _Alignas(32) uint32_t tmp[li];
    memcpy(tmp, keylocks, li*sizeof(*keylocks));

    int res = 0;
    for (int i = li; i < len; ++i) {
    uint32_t lock = keylocks[i];
    for (int k = 0; k < li; ++k)
    res += (lock & tmp[k])==0;
    }
    return res;
    }

    The real reason for major speedup in the variant from previous post is elimination of handling of 26-item tail of the key[] array.
    clang generates word-by-word counting code for the tail, so handling
    of 26-word tail takes more time and more instructions than
    handling of 224-word body.
    That is also a reason of my confusion with number of executed
    instructions in measurements of Anton.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Tue Feb 18 15:07:39 2025
    On Tue, 18 Feb 2025 02:55:33 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    It takes Round Nearest Odd to perform Kahan-Babashuka Summation.


    Are you aware of any widespread hardware that supplies Round to Nearest
    with tie broken to Odd? Or of any widespread language that can request
    such rounding mode?
    Until both, implementing RNO on niche HW looks to me as wastage of both
    HW resources and of space in your datasheet.

    Instead, think of what you possibly forgot to do in order to help
    software implementation IEEE binary128. That would be orders of
    magnitude more useful in real world. And don't take me wrong, "orders of magnitude more useful" is still small niche on the absolute scale of usefulness.

    That is:: comply with IEEE 754-2019


    I'd say, comply with mandatory requirements of IEEE 754-2019.
    For optional requirements, be selective. Prefer those that can be
    accessed from widespread languages (including incoming editions of
    language standards) over the rest.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Tue Feb 18 17:49:33 2025
    Michael S <already5chosen@yahoo.com> schrieb:

    I'd say, comply with mandatory requirements of IEEE 754-2019.
    For optional requirements, be selective. Prefer those that can be
    accessed from widespread languages (including incoming editions of
    language standards) over the rest.

    Fortran 2023 (now published) has quite a selection of IEEE
    intrinsics, and they have found their way into the My 66000 spec :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Feb 18 18:03:06 2025
    On Tue, 18 Feb 2025 13:07:39 +0000, Michael S wrote:

    On Tue, 18 Feb 2025 02:55:33 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    It takes Round Nearest Odd to perform Kahan-Babashuka Summation.


    Are you aware of any widespread hardware that supplies Round to Nearest
    with tie broken to Odd? Or of any widespread language that can request
    such rounding mode?

    No, No

    Until both, implementing RNO on niche HW looks to me as wastage of both
    HW resources and of space in your datasheet.

    They way I implement it, it is only an additional 10± gates.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to BGB on Tue Feb 18 19:25:31 2025
    BGB <cr88192@gmail.com> wrote:
    On 2/17/2025 11:07 PM, Robert Finch wrote:
    On 2025-02-17 8:00 p.m., BGB wrote:
    On 2/14/2025 3:52 PM, MitchAlsup1 wrote:
    On Fri, 14 Feb 2025 21:14:11 +0000, BGB wrote:

    On 2/13/2025 1:09 PM, Marcus wrote:
    -------------

    The problem arises when the programmer *deliberately* does unaligned >>>>>> loads and stores in order to improve performance. Or rather, if the >>>>>> programmer knows that the hardware supports unaligned loads and
    stores,
    he/she can use that to write faster code in some special cases.


    Pretty much.


    This is partly why I am in favor of potentially adding explicit
    keywords
    for some of these cases, or to reiterate:
       __aligned:
         Inform compiler that a pointer is aligned.
         May use a faster version if appropriate.
           If a faster aligned-only variant exists of an instruction. >>>>>        On an otherwise unaligned-safe target.
       __unaligned: Inform compiler that an access is unaligned.
         May use a runtime call or similar if necessary,
           on an aligned-only target.
         May do nothing on an unaligned-safe target.
       None: Do whatever is the default.
         Presumably, assume aligned by default,
           unless target is known unaligned-safe.

    It would take LESS total man-power world-wide and over-time to
    simply make HW perform misaligned accesses.



    I think the usual issue is that on low-end hardware, it is seen as
    "better" to skip out on misaligned access in order to save some cost
    in the L1 cache.

    I always include support for unaligned accesses even with a ‘low-end’
    CPU. I think it is not that expensive and sure makes some things a lot
    easier when handled in hardware. For Q+ it just runs two bus cycles if
    the data spans a cache line and pastes results together as needed.


    I had went aligned-only with some 32-bit cores in the past.

    Whole CPU core fit into less LUTs than I currently spend on just the L1
    D$...

    Granted, some of these used a very minimal L1 cache design:
    Only holds a single cache line.

    The smallest cores I had managed had used a simplified SH-based design:
    Fixed-length 16 bit instructions, with 16 registers;
    Only (Reg) and (Reg, R0) addressing;
    Aligned only;
    No shift or multiply;

    You mean no variable shift, or no large shifts, you have to support divide
    by 2, right?

    Where, say:
    SH-4 -> BJX1-32 (Added features)
    SH-4 -> B32V (Stripped down)
    BJX1-32 -> BJX1-64A (64-bit, Modal Encoding)
    B32V -> B64V (64-bit, Encoding Space Reorganizations)
    B64V ~> BJX1-64C (No longer Modal)

    Where, BJX1-64C was the end of this project (before I effectively did a soft-reboot).


    Then transition phase:
    B64V -> BtSR1 (Dropped to 32-bit, More Encoding Changes)
    Significant reorganization.
    Was trying to get optimize for code density closer to MSP430.
    BtSR1 -> BJX2 (Back to 64-bit, re-adding features from BJX1-64C)
    A few features added for BtSR1 were dropped again in BJX2.

    The original form of BJX2 was still a primarily 16-bit ISA encoding, but
    at this point pretty much mutated beyond recognition (and relatively few instructions were still in the same places that they were in SH-4).


    For example (original 16-bit space):
    0zzz:
    SH-4: Ld/St (Rm,R0); also 0R and 1R spaces, etc.
    BJX2: Ld/St Only (Rm) and (Rm,R0)
    1zzz:
    SH-4: Store (Rn, Disp4)
    BJX2: 2R ALU ops
    2zzz:
    SH-4: Store (@Rn, @-Rn), ALU ops
    BJX2: Branch Ops (Disp8), etc
    3zzz:
    SH-4: ALU ops
    BJX2: 0R and 1R ops
    4zzz:
    SH-4: 1R ops
    BJX2: Ld/St (SP, Disp4); MOV-CR, LEA
    5zzz:
    SH-4: Load (Rm, Disp4)
    BJX2: Load (Unsigned), ALU ops
    6zzz:
    SH-4: Load (@Rm+ and @Rm), ALU
    BJX2: FPU ops, CMP-Imm4
    7zzz:
    SH-4: ADD Imm8, Rn
    BJX2: (XGPR 32-bit Escape Block)
    8zzz:
    SH-4: Branch (Disp8)
    BJX2: Ld/St (Rm, Disp3)
    9zzz:
    SH-4: Load (PC-Rel)
    BJX2: (XGPR 32-bit Escape Block)
    Azzz:
    SH-4: BRA Disp12
    BJX2: MOV Imm12u, R0
    Bzzz:
    SH-4: BSR Disp12
    BJX2: MOV Imm12n, R0
    Czzz:
    SH-4: Some Imm8 ops
    BJX2: ADD Imm8, Rn
    Dzzz:
    SH-4: Load (PC-Rel)
    BJX2: MOV Imm8, Rn
    Ezzz:
    SH-4: MOV Imm8, Rn
    BJX2: (32-bit Escape, Predicated Ops)
    Fzzz:
    SH-4: FPU Ops
    BJX2: (32-bit Escape, Unconditional Ops)

    For the 16-bit ops, SH-4 had more addressing modes than BJX2:
    SH-4: @Reg, @Rm+, @-Rn, @(Reg,R0), @(Reg,Disp4) @(PC,Disp8)
    BJX2: (Rm), (Rm,R0), (Rm,Disp3), (SP,Disp4)

    Although it may seem like it, I didn't just completely start over on the layout, but rather it was sort of an "ant-hill reorganization".


    Say, for example:
    1zzz and 5zzz were merged into 8zzz, reducing Disp by 1 bit
    2zzz and 3zzz was partly folded into 0zzz and 1zzz
    8zzz's contents were moved to 2zzz
    4zzz and part of 0zzz were merged into 3zzz
    ...


    A few CR's are still in the same places and SR still has a similar
    layout I guess, ...



    Early on, there was the idea that the 32-bit ops were prefix-modified versions of the 16-bit ops, but early on this symmetry broke and the 16
    and 32-bit encoding spaces became independent of each other.

    Though, the 32-bit F0 space still has some amount of similarity to the
    16-bit space.


    Later on I did some testing and performance comparisons, and realized
    that using 32-bit encodings primarily (or exclusively) gave
    significantly better performance than relying primarily or exclusively
    on 16-bit ops. And at this point the ISA transitioned from a primarily
    16-bit ISA (with 32-bit extension ops) to a primarily 32-bit ISA with a 16-bit encoding space. This transition didn't directly effect encodings,
    but did effect how the ISA developed from then going forward (more so,
    there was no longer an idea that the 16-bit ISA would need to be able to exist standalone; but now the 32-bit ISA did need to be able to exist standalone).



    But, now newer forms of BJX2 (XG2 and XG3) have become almost
    unrecognizable from early BJX2 (as an ISA still primarily built around
    16-bit encodings).

    Except that XG2's instruction layout still carries vestiges of its
    origins as a prefix encoding. But, XG3 even makes this part disappear
    (by reorganizing the bits to more closely resemble RISC-V's layout).

    Well, and there is:
    ZnmX -> ZXnm

    But:
    F0nm_ZeoX


    I prefer my strategy instead:
       FADD/FSUB/FMUL:
         Hard-wired Round-Nearest / RNE.
         Does not modify FPU flags.
       FADDG/FSUBG/FMULG:
         Dynamic Rounding;
         May modify FPU flags.

    Can note that RISC-V burns 3 bits for FPU instructions always encoding
    a rounding mode (whereas in my ISA, encoding a rounding mode other
    than RNE or DYN requiring a 64-bit encoding).


    Q+ encodes rounding mode the same way as RISCV as there are lots of bit
    available in the instruction. Burning bits on the rounding mode seems
    reasonable to me when bits are available.


    Initially:
    3 bits of entropy were eaten by the 16-bit space;
    2 more bits were eaten by predication and WEX.

    So, the initial ISA design for 32-bit ops had 5 less bits than in RISC-V land.

    XG2 reclaimed the 16-bit space, but used the bits to expand all the
    register fields to 6 bits.


    Not many bits left to justify burning on a rounding mode.
    And, my Imm/Disp fields were generally 3 bits less than RV.



    Modified the PRED modifier in Q+ to take a predicate bit from one of
    three registers used to supply bits. Previously an array of two-bit mask
    values encoded in the instruction indicated to 1) ignore the predicate
    bit 2) execute if predicate true or 3) execute if predicate false.
    Since there were three reg specs available in the PRED modifier, it
    seemed to make more sense to specify three regs instead of one. So now
    it works 1) as before 2) execute if bit in Ra is set, 3) execute if bit
    in Rb is set, 3) execute if bit in Rc is set.
    The same register may be specified for Ra, Rb, and Rc. Since there is
    sign inversion available, the original operation may be mimicked by
    specifying Ra, ~Ra.


    In BJX2, all 32-bit instructions encode predication in 2 bits in each instruction.

    In XG3, the space that would have otherwise encoded WEX was instead left
    to RISC-V (to create a conglomerate ISA).

    But, there is also the possibility to use XG3 by itself without any
    RISC-V parts in the mix.






    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcus@21:1/5 to All on Tue Feb 18 20:57:35 2025
    Den 2025-02-18 kl. 20:25, skrev Brett:
    BGB <cr88192@gmail.com> wrote:

    [snip]

    The smallest cores I had managed had used a simplified SH-based design:
    Fixed-length 16 bit instructions, with 16 registers;
    Only (Reg) and (Reg, R0) addressing;
    Aligned only;
    No shift or multiply;

    You mean no variable shift, or no large shifts, you have to support divide
    by 2, right?


    Yes, LSL 1 can be implemented by ADD, but LSR/ASR 1 needs a dedicated instruction, right?

    IIRC the SuperH has some power-of-two shift instructions, e.g:

    shlr Rn
    shlr2 Rn
    shlr4 Rn
    shlr8 Rn
    shlr16 Rn

    It takes up some encoding space and costs extra cycles/instructions to
    do a full shift (e.g. 7=4+2+1), but I guess you can make relatively
    cheap shift hardware that way? Maybe you can get away with even fewer instructions (e.g. only 1, 4, 16)?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Tue Feb 18 22:09:54 2025
    MitchAlsup1 wrote:
    On Tue, 18 Feb 2025 13:07:39 +0000, Michael S wrote:

    On Tue, 18 Feb 2025 02:55:33 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    It takes Round Nearest Odd to perform Kahan-Babashuka Summation.


    Are you aware of any widespread hardware that supplies Round to Nearest
    with tie broken to Odd? Or of any widespread language that can request
    such rounding mode?

    No, No

    Until both, implementing RNO on niche HW looks to me as wastage of both
    HW resources and of space in your datasheet.

    They way I implement it, it is only an additional 10± gates.

    With discrete logic, it should be identical to RNE, except for flipping
    the ulp bit when deciding upon the rounding direction, right?

    With a full 4-bit lookup table you need a few more gates, but that is
    still the obvious way to implement rounding in SW. (It is only ceil()
    and floor() that requires the sign bit as input, the remaining rounding
    modes can make do with ulp+guard+sticky.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Feb 18 22:33:41 2025
    On Tue, 18 Feb 2025 21:09:54 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Tue, 18 Feb 2025 13:07:39 +0000, Michael S wrote:

    On Tue, 18 Feb 2025 02:55:33 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    It takes Round Nearest Odd to perform Kahan-Babashuka Summation.


    Are you aware of any widespread hardware that supplies Round to Nearest
    with tie broken to Odd? Or of any widespread language that can request
    such rounding mode?

    No, No

    Until both, implementing RNO on niche HW looks to me as wastage of both
    HW resources and of space in your datasheet.

    They way I implement it, it is only an additional 10± gates.

    With discrete logic, it should be identical to RNE, except for flipping
    the ulp bit when deciding upon the rounding direction, right?

    Yes,

    With a full 4-bit lookup table you need a few more gates, but that is
    still the obvious way to implement rounding in SW. (It is only ceil()
    and floor() that requires the sign bit as input, the remaining rounding
    modes can make do with ulp+guard+sticky.

    sign+ULP+Gard+sticky is all you ever need for any rounding mode
    IEEE or beyond.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Feb 18 23:31:49 2025
    On Tue, 18 Feb 2025 22:34:48 +0000, BGB wrote:


    Say, one could imagine an abstract model where Binary64 FADD works sort
    of like:
    sgnA=valA>>63;
    sgnB=valA>>63;
    expA=(valA>>52)&2047;
    expB=(valB>>52)&2047;
    fraA=(valA&((1ULL<<52)-1));
    fraB=(valB&((1ULL<<52)-1));
    if(expA!=0)fraA|=1ULL<<52;
    if(expB!=0)fraB|=1ULL<<52;
    fraA=fraA<<9; //9 sub ULP bits
    fraB=fraB<<9;
    shrA=(expB>=expA)?(expB-expA):0;
    shrB=(expA>=expB)?(expA-expA):0;
    expA-expB
    sgn2A=sgnA; exp2A=expA; fra2A=fraA>>shrA;
    sgn2B=sgnB; exp2B=expB; fra2B=fraB>>shrB;
    //logical clock-edge here.
    fr1C_A=fra2A+fra2B;
    fr1C_B=fra2A-fra2B;
    fr1C_C=fra2B-fra2A;
    if(sgn2A^sgn2B)
    {
    if(fr1C_C>>63)
    { sgn1C=sgn2A; fra1C=fr1C_B; }
    else
    { sgn1C=sgn2B; fra1C=fr1C_C; }
    }
    else
    { sgn1C=!sgn2A; fra1C=fr1C_A; }
    //logical clock-edge here.
    if(fra2C>>62)
    { exp3C=exp2C+1; fra3C=fra2C>>1; }
    else
    { shl=clz64(fra2C)-2; exp3C=exp2C-shl; fra3C=fra2C<<shl; }
    //logical clock-edge here.
    if((exp3C>=2047) || (exp3C<=0))
    { sgnC=sgn2C; expC=(exp3C<=0)?0:2047; fraC=0; }
    else
    {
    sgnC=sgn2C; expC=exp3C; fraC=fra3C>>9;
    //if rounding is done, it goes here.
    }
    valC=(sgnC<<63)|(expC<<52)|fraC;
    //final clock edge.
    //result is now ready.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Tue Feb 25 08:11:01 2025
    MitchAlsup1 wrote:
    On Mon, 24 Feb 2025 19:52:49 +0000, Robert Finch wrote:

    CAMs can easily be implemented in FPGAs although they may have
    multi-cycle latency.

    A CAM is a vector of XOR gate inputs that feed an AND gate.

    A 5-bit CAM with valid bit is 3-gates in CMOS and 2-gates of delay.
    It is only when there are lots of bits being CAMed does the latency
    increase markedly -- OR when there are lots of entries being CAMed
    but this is a FAN-IN buffering problem not a gate delay or gate logic problem.

    Yes but since FPGA's don't have CAM's people have figured out how
    to build pseudo-CAM's from multiple SRAM lookup tables.

    One has only to research CAM implementation in
    FPGAs. Register files with multiple ports are easily implemented with
    replication.

    Read ports can be added by replication, write ports cannot.

    Pseudo-write ports can be built from duplicating register files in
    multiple banks and either:

    - Live Value Table tracks which port last wrote each bank's register
    - XOR stores the values from each port xor'd together
    and regenerates the original values on read

    See the papers I reference in my reply to Robert Finch today.

    None of these are anything like the real things.
    But if you can't afford an ASIC then these are the only options.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Tue Feb 25 07:51:12 2025
    Robert Finch wrote:
    Respecting I do not know that much about the work environment of FPGA developers:
    I have thought of FPGAs as more of a prototyping tool, or to be used in one-off designs, proof-of-concept type things. In those cases one
    probably does not care too much about manual operations, as was said one would be more interested in productivity of developers that comes from reliable tools and being able to deal with things at a high level.

    The vendor’s have a number of pre-made components that can be plugged
    into a design making it possible to sketch out a design very quickly
    with a couple of caveats. One being one might be stuck to a particular vendor.

    CAMs can easily be implemented in FPGAs although they may have
    multi-cycle latency. One has only to research CAM implementation in
    FPGAs.

    Papers on pseudo-BCAM and pseudo-TCAM designs built from multiple SRAMs
    on FPGA's:

    Deep and Narrow Binary Content-Addressable Memories using
    FPGA-based BRAMs, 2014 https://www.ece.mcmaster.ca/faculty/ameer/publications/Abdelhadi-Conference-2014Dec-FPT2014-DeepandNarrowBCAMs-full.pdf

    Modular SRAM-Based Binary Content-Addressable Memories, 2015 https://www.ece.mcmaster.ca/~ameer/publications/Abdelhadi-Conference-2015May-FCCM2015-ModularBCAMs.pdf

    Modular Block-RAM-Based Longest-Prefix Match Ternary Content-Addressable Memories, 2018 https://www.ece.mcmaster.ca/~ameer/publications/Abdelhadi-Conference-2018Aug-FPL2018-LPM-TCAMs.pdf

    Register files with multiple ports are easily implemented with
    replication.

    Two ways for getting multiple write ports from FPGA single write
    port register files - Live Value Table (LVT), and XOR approaches:

    Efficient Multi-Ported Memories for FPGAs, 2010 http://www.fpgacpu.ca/publications/FPGA2010-LaForest-Paper.pdf

    Multi-Ported Memories for FPGAs via XOR, 2012 https://fpgacpu.ca/multiport/FPGA2012-LaForest-XOR-Paper.pdf

    It may be nice to see a CAM component in a vendor library.
    Register files sometimes have bypassing requirements that might make it challenging to develop a generic component.

    All of these pseudo designs have speed and duplication cost drawbacks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB on Tue Feb 25 09:32:15 2025
    BGB wrote:
    On 2/21/2025 1:51 PM, EricP wrote:

    and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:

    Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
    http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf


    Errm, skim, this doesn't really look like something you can pull off in normal Verilog.

    Generally, one doesn't control over how the components hook together,
    only one can influence what happens based on how they write their Verilog.

    You can just write:
    reg[63:0] tValA;
    reg[63:0] tValB;
    reg[63:0] tValC;
    tValC=tValA+tValB;


    But, then it spits out something with a chain of 16 CARRY4's, so there
    is a fairly high latency on the high order bits of the result.

    Possibly this is how they do it, the equivalent of inline assembler:

    Optimizing Xilinx designs through primitive instantiation, 2010 https://scholar.archive.org/work/25gxuk4if5fntmhzx3gciltk2a/access/wayback/http://www.da.isy.liu.se/pubs/ehliar/ehliar-FPGA2010ORG.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Tue Feb 25 09:20:45 2025
    Michael S wrote:
    On Mon, 24 Feb 2025 11:52:38 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Sun, 23 Feb 2025 11:13:53 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    It looks to me that Vivado intends that after you get your basic
    design working, this module optimization is *exactly* what one is
    supposed to do.

    In this case the prototype design establishes that you need
    multiple 64-bit adders and the generic ones synthesis spits out
    are slow. So you isolate that module off, use Verilog to drive the
    basic LE selections, then iterate doing relative LE placement
    specifiers, route the module, and when you get the fastest 64-bit
    adder you can then lock down the netlist and save the module
    design.

    Now you have a plug-in 64-bit adder module that runs at (I don't
    know the speed difference between Virtex and your Spartan-7 so
    wild guess) oh, say, 4 ns, to use multiple places... fetch,
    decode, alu, agu.

    Then plug that into your ALU, add in SUB, AND, OR, XOR, functions,
    isolate that module, optimize placement, route, lock down netlist,
    and now you have a 5 ns plug-in ALU module.

    Doing this you build up your own IP library of optimized hardware
    modules.

    As more and more modules are optimized the system synthesis gets
    faster because much of the fine grain work and routing is already
    done.


    It sounds like your 1st hand FPGA design experience is VERY
    outdated.
    Never have, likely never will.
    Nothing against them - looks easier than wire-wrapping TTL and 4000
    CMOS. Though people do seem to spend an awful lot of time working
    around certain deficiencies like the lack of >1 write ports on
    register files, and the lack of CAM's. One would think market forces
    would induce at least one supplier to add these and take the fpga
    market by storm.


    Your view is probably skewed by talking to soft core hobbyists.
    Please realize that most professionals do not care about
    high-performance soft core. Soft core is for control plane functions
    rather than for data plane. Important features are ease of use,
    reliability, esp. of software tools and small size. Performance is
    rated low. Performance per clock is rated even lower. So, professional
    do not develop soft cores by themselves. And OTS cores that they use
    are not superscalar. Quite often not even fully pipelined.
    It means, no, small SRAM banks with two independent write ports is not
    a feature that FPGA pros would be excited about.

    The papers I'm reading are from both professionals and academic
    mostly published in IEEE or ACM. Rarely hobbyist.

    For pure soft cores, yes there seems limited commercial use for them.
    There is use as microcontrollers for other FPGA logic, and if your
    FPGA doesn't have an ARM or RV built in then you need to build one.
    And if the microcontroller speed was a consideration you should
    have selected an fpga with one built in.

    Multiple write ports on register files is useful in custom accelerators
    with parallel pipelines. And I see lots of reference to custom accelerators built on FPGA's.

    Lots of people seem to be building things like IP packet matchers which
    require Binary BCAM's or Ternary TCAM's (CAM's with don't care bits).

    I'm just trying to figure out what these other folks are doing to get
    bleeding edge performance from essentially the same tools and similar
    chips.

    I assume you are referring to the gui IDE interface for things like
    floor planning where you click on a LE cells and set some attributes.
    I also think I saw reference to locking down parts of the net list.
    But there are a lot of documents to go through.


    No, I mean florplanning, as well as most other manual physical-level optimization are not used at all in 99% percents of FPGA designs that
    started after year 2005.

    Is that because the auto place and route got good enough that it is unnecessary? Or maybe the fpga resources grew enough that autoroute
    didn't have to struggle to find optimal positions and paths
    (being an optimal packing problem and a traveling salesman problem).

    Also BGB mentioned in another thread a while back that he was getting
    what sounded like random variation of critical paths from run to run.
    That suggests to me the automatic tools may not be properly recognizing
    the different modules and produce some non-optimal positions or paths.
    So giving it a hint that "this stuff goes together" might help.

    Anyway, it should be testable. Inspect the auto placements module wiring
    and if there are any obviously crazy decision then try the placement tool
    an see if the speed improves or critical path variation goes away.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Feb 25 18:07:10 2025
    On Tue, 25 Feb 2025 14:20:45 +0000, EricP wrote:

    Michael S wrote:
    --------------------

    No, I mean florplanning, as well as most other manual physical-level
    optimization are not used at all in 99% percents of FPGA designs that
    started after year 2005.

    Is that because the auto place and route got good enough that it is unnecessary? Or maybe the fpga resources grew enough that autoroute
    didn't have to struggle to find optimal positions and paths
    (being an optimal packing problem and a traveling salesman problem).

    Athlon (1998) used hand place auto-route. So, auto-route has been
    good enough since 2000 at latest.

    Also BGB mentioned in another thread a while back that he was getting
    what sounded like random variation of critical paths from run to run.
    That suggests to me the automatic tools may not be properly recognizing
    the different modules and produce some non-optimal positions or paths.
    So giving it a hint that "this stuff goes together" might help.

    Consider the optimizer/place/route thingamabob; and a signal that
    crosses from one module to another. The optimizer changes from
    a 2-LUT delay to a 1 LUT delay, but now the fan-out of that LUT
    doubles, so instead of speeding up, the signal path slows down.

    Anyway, it should be testable. Inspect the auto placements module wiring
    and if there are any obviously crazy decision then try the placement
    tool
    an see if the speed improves or critical path variation goes away.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Tue Feb 25 15:24:34 2025
    MitchAlsup1 wrote:
    On Tue, 25 Feb 2025 14:20:45 +0000, EricP wrote:

    Michael S wrote:
    --------------------

    No, I mean florplanning, as well as most other manual physical-level
    optimization are not used at all in 99% percents of FPGA designs that
    started after year 2005.

    Is that because the auto place and route got good enough that it is
    unnecessary? Or maybe the fpga resources grew enough that autoroute
    didn't have to struggle to find optimal positions and paths
    (being an optimal packing problem and a traveling salesman problem).

    Athlon (1998) used hand place auto-route. So, auto-route has been
    good enough since 2000 at latest.

    Also BGB mentioned in another thread a while back that he was getting
    what sounded like random variation of critical paths from run to run.
    That suggests to me the automatic tools may not be properly recognizing
    the different modules and produce some non-optimal positions or paths.
    So giving it a hint that "this stuff goes together" might help.

    Consider the optimizer/place/route thingamabob; and a signal that
    crosses from one module to another. The optimizer changes from
    a 2-LUT delay to a 1 LUT delay, but now the fan-out of that LUT
    doubles, so instead of speeding up, the signal path slows down.

    Or if it is not doing a full traveling salesman it might be sensitive
    to the order items are on a list. E.g. it needs to connect outputs
    O1 and O2 to logic L1 and L2 (L1 and L2 are identical).

    If it pops O1 first it connects both straight through

    O1 O2
    v v
    L1 L2

    but it it pops O2 first it connects to L1 and O1 is blocked so
    it has to take a trip all the way around

    -------
    | |
    O1 O2 |
    / |
    L1 L2<-

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed Feb 19 17:35:41 2025
    MitchAlsup1 wrote:
    On Tue, 18 Feb 2025 21:09:54 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Tue, 18 Feb 2025 13:07:39 +0000, Michael S wrote:

    On Tue, 18 Feb 2025 02:55:33 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    It takes Round Nearest Odd to perform Kahan-Babashuka Summation.


    Are you aware of any widespread hardware that supplies Round to Nearest >>>> with tie broken to Odd? Or of any widespread language that can request >>>> such rounding mode?

    No, No

    Until both, implementing RNO on niche HW looks to me as wastage of both >>>> HW resources and of space in your datasheet.

    They way I implement it, it is only an additional 10± gates.

    With discrete logic, it should be identical to RNE, except for flipping
    the ulp bit when deciding upon the rounding direction, right?

    Yes,

    With a full 4-bit lookup table you need a few more gates, but that is
    still the obvious way to implement rounding in SW. (It is only ceil()
    and floor() that requires the sign bit as input, the remaining rounding
    modes can make do with ulp+guard+sticky.

    sign+ULP+Gard+sticky is all you ever need for any rounding mode
    IEEE or beyond.

    That's what I believed all through the 2019 standards process and up to
    a month or two ago:

    In reality, the "NearestOrEven" rounding rule has an exception if/when
    you need to round the largest possible fp number, with guard=1 and
    sticky=0:

    I.e. exactly halfway to the next possible value (which would be Inf)

    In just this particular case, the OrEven part is skipped in favor of not rounding up, so leaving a maximum/odd mantissa.

    In the same case but sticky=1 we do round up to Inf.

    This unfortunately means that the rounding circuit needs to be combined
    with an exp+mant==0b111...111 input. :-(

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Wed Feb 19 17:31:08 2025
    On Wed, 19 Feb 2025 16:35:41 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Tue, 18 Feb 2025 21:09:54 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Tue, 18 Feb 2025 13:07:39 +0000, Michael S wrote:

    On Tue, 18 Feb 2025 02:55:33 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    It takes Round Nearest Odd to perform Kahan-Babashuka Summation.


    Are you aware of any widespread hardware that supplies Round to Nearest >>>>> with tie broken to Odd? Or of any widespread language that can request >>>>> such rounding mode?

    No, No

    Until both, implementing RNO on niche HW looks to me as wastage of both >>>>> HW resources and of space in your datasheet.

    They way I implement it, it is only an additional 10± gates.

    With discrete logic, it should be identical to RNE, except for flipping
    the ulp bit when deciding upon the rounding direction, right?

    Yes,

    With a full 4-bit lookup table you need a few more gates, but that is
    still the obvious way to implement rounding in SW. (It is only ceil()
    and floor() that requires the sign bit as input, the remaining rounding
    modes can make do with ulp+guard+sticky.

    sign+ULP+Gard+sticky is all you ever need for any rounding mode
    IEEE or beyond.

    That's what I believed all through the 2019 standards process and up to
    a month or two ago:

    In reality, the "NearestOrEven" rounding rule has an exception if/when
    you need to round the largest possible fp number, with guard=1 and
    sticky=0:

    I.e. exactly halfway to the next possible value (which would be Inf)

    In just this particular case, the OrEven part is skipped in favor of not rounding up, so leaving a maximum/odd mantissa.

    In the same case but sticky=1 we do round up to Inf.

    This unfortunately means that the rounding circuit needs to be combined
    with an exp+mant==0b111...111 input. :-(

    You should rename that mode as "Round but stay finite"

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Feb 20 03:02:56 2025
    On Wed, 19 Feb 2025 22:42:04 +0000, BGB wrote:

    On 2/19/2025 11:31 AM, MitchAlsup1 wrote:
    On Wed, 19 Feb 2025 16:35:41 +0000, Terje Mathisen wrote:

    ------------------
    sign+ULP+Gard+sticky is all you ever need for any rounding mode
    IEEE or beyond.

    That's what I believed all through the 2019 standards process and up to
    a month or two ago:

    In reality, the "NearestOrEven" rounding rule has an exception if/when
    you need to round the largest possible fp number, with guard=1 and
    sticky=0:

    I.e. exactly halfway to the next possible value (which would be Inf)

    In just this particular case, the OrEven part is skipped in favor of not >>> rounding up, so leaving a maximum/odd mantissa.

    In the same case but sticky=1 we do round up to Inf.

    This unfortunately means that the rounding circuit needs to be combined
    with an exp+mant==0b111...111 input. :-(

    You should rename that mode as "Round but stay finite"


    So, does it overflow?...

    Based on how IEEE 754 wo9rked throughout its history::

    If the calculation overflows without the need for rounding;
    yes, it overflows. It is just that rounding all by itself does
    not overflow that is different.
    ----------------

    Admittedly part of why I have such mixed feelings on full
    compare-and-branch:
    Pro: It can offer a performance advantage (in terms of per-clock);
    Con: Branch is now beholden to the latency of a Subtract.
    Con: it can't compare to a constant
    Con: it can't compare floating point

    ----------------

    Where, detecting all zeroes is at least cheaper than a subtract. But, detecting all zeroes still isn't free (for 64b, ~ 10 LUTs and 3 LUTs
    delay).

    1 gate 4-inputs inverted
    2 gates 16-inputs true
    3-gates 64-inputs inverted

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Fri Feb 7 11:06:43 2025
    Michael S wrote:
    On Thu, 6 Feb 2025 21:36:38 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    BTW, when I timed 1000 calls to that 5-6 us program, to get around
    teh 100 ns timer resolution, each iteration ran in 5.23 us.

    That measurement could be good enough on desktop. Or not.
    It certainly not good enough on laptop and even less so on server.
    On laptop I wouldn't be sutisfied before I lok my program to
    particualr core, then do something like 21 measurements with 100K calls
    in each measurement (~10 sec total) and report median of 21.

    Each measurement did 1000 calls, then I ran 100 such measurements. The
    5.23 us value was the lowest seen among the 100, with average a bit more:


    Slowest: 9205200 ns
    Fastest: 5247500 ns
    Average: 5672529 ns/iter
    Part1 = 3338

    My own (old, but somewhat kept up to date) cputype program reported that
    it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to CPUID.

    Is that sufficient to judge the performance?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcus@21:1/5 to Anton Ertl on Thu Feb 13 20:09:42 2025
    On 2025-02-03, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 2/2/2025 10:45 AM, EricP wrote:
    Digging deeper with performance counters reveals executing each unaligned >>> load instruction results in ~505 executed instructions. P550 almost
    certainly doesn’t have hardware support for unaligned accesses.
    Rather, it’s likely raising a fault and letting an operating system
    handler emulate it in software."


    An emulation fault, or something similarly nasty...


    At that point, even turning any potentially unaligned load or store into
    a runtime call is likely to be a lot cheaper.

    There are lots of potentially unaligned loads and stores. There are
    very few actually unaligned loads and stores: On Linux-Alpha every
    unaligned access is logged by default, and the number of
    unaligned-access entries in the logs of our machines was relatively
    small (on average a few per day). So trapping actual unaligned
    accesses was faster than replacing potential unaligned accesses with
    code sequences that synthesize the unaligned access from aligned
    accesses.

    If you compile regular C/C++ code that does not intentionally do any
    nasty stuff, you will typically have zero unaligned loads stores.

    My machine still does not support unaligned accesses in hardware (it's
    on the todo list), and it can run an awful lot of software without
    problems.

    The problem arises when the programmer *deliberately* does unaligned
    loads and stores in order to improve performance. Or rather, if the
    programmer knows that the hardware supports unaligned loads and stores,
    he/she can use that to write faster code in some special cases.


    Of course, if the cost of unaligned accesses is that high, you will
    avoid them in cases like block copies where cheap unaligned accesses
    would otherwise be beneficial.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Marcus on Thu Feb 13 19:21:36 2025
    On 2025-02-13, Marcus <m.delete@this.bitsnbites.eu> wrote:

    The problem arises when the programmer *deliberately* does unaligned
    loads and stores in order to improve performance. Or rather, if the programmer knows that the hardware supports unaligned loads and stores, he/she can use that to write faster code in some special cases.

    Or the compiler or standard library author can do the same.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Fri Feb 7 22:28:55 2025
    Thanks Michael, your code looks similar to what I wrote when I tried to
    use intrinsics.

    I also tried to extend both the locks and keys from 250 to 256, so that
    there would be zero tail overhead, but that was always slightly slower.

    My 4-wide is pretty much the same as your core code below.

    Terje

    Michael S wrote:
    On Fri, 7 Feb 2025 15:23:51 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Fri, 7 Feb 2025 11:06:43 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Thu, 6 Feb 2025 21:36:38 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    BTW, when I timed 1000 calls to that 5-6 us program, to get
    around teh 100 ns timer resolution, each iteration ran in 5.23
    us.

    That measurement could be good enough on desktop. Or not.
    It certainly not good enough on laptop and even less so on server.
    On laptop I wouldn't be sutisfied before I lok my program to
    particualr core, then do something like 21 measurements with 100K
    calls in each measurement (~10 sec total) and report median of
    21.

    Each measurement did 1000 calls, then I ran 100 such measurements.
    The 5.23 us value was the lowest seen among the 100, with average a
    bit more:


    Slowest: 9205200 ns
    Fastest: 5247500 ns
    Average: 5672529 ns/iter
    Part1 = 3338

    My own (old, but somewhat kept up to date) cputype program reported
    that it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to
    CPUID.

    Is that sufficient to judge the performance?

    Terje


    Not really.
    i7-1365U is a complicated beast. 2 "big" cores, 8 "medium" cores.
    Frequency varies ALOT, 1.8 to 5.2 GHz on "big", 1.3 to 3.9 GHz on
    "medium".

    OK. It seems like the big cores are similar to what I've had
    previously, i.e. each core supports hyperthreading, while the medium
    ones don't. This results in 12 HW threads.

    As I said above, on such CPU I wouldn't believe the numbers before
    total duration of test is 10 seconds and the test run is locked to
    particular core. As to 5 msec per measurement, that's enough, but
    why not do longer measurements if you have to run for 10 sec
    anyway?

    The Advent of Code task required exactly 250 keys and 250 locks to be
    tested, this of course fits easily in a corner of $L1 (2000 bytes).

    The input file to be parsed was 43*500 = 21500 bytes long, so this
    should also fit in $L1 when I run repeated tests.

    Under Windows I can set thread affinity to lock a process to a given
    core, but how do I know which are "Big" and "Medium"?

    Trial and error?
    I think, big cores/threads tend to be with lower numbers, but I am not
    sure it is universal.



    Terje


    In the mean time.
    I did few measurements on Xeon E3 1271 v3. That is rather old uArch - Haswell, the first core that supports AVX2. During the tests it was
    running at 4.0 GHz.

    1. Original code (rewritten in plain C) compiled with clang -O3 -march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
    compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
    AVX2 code compiled with clang -O3 -march=skylake (AVX2)

    Results were as following (usec/call)
    1 - 5.66
    2 - 5.56
    3 - 2.18

    So, my measurements, similarly to your measurements, demonstrate that
    clang autovectorized code looks good, but performs not too good.


    Here is my manual code. Handling of the tail is too clever. I did not
    have time to simplify. Otherwise, for 250x250 it should perform about
    the same as simpler code.

    #include <stdint.h>
    #include <immintrin.h>

    int foo_tst(const uint32_t* keylocks, int len, int li)
    {
    if (li >= len || li <= 0)
    return 0;
    const uint32_t* keyx = &keylocks[li];
    unsigned ni = len - li;
    __m256i res0 = _mm256_setzero_si256();
    __m256i res1 = _mm256_setzero_si256();
    __m256i res2 = _mm256_setzero_si256();
    __m256i res3 = _mm256_setzero_si256();
    const uint32_t* keyx_last = &keyx[ni & -32];
    for (; keyx != keyx_last; keyx += 32) {
    __m256i lock0 = _mm256_loadu_si256((const __m256i*)&keyx[0*8]);
    __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyx[1*8]);
    __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyx[2*8]);
    __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyx[3*8]);
    // for (int k = 0; k < li; ++k) {
    // for (int k = 0, nk = li; nk > 0; ++k, --nk) {
    for (const uint32_t* keyy = keylocks; keyy != &keylocks[li];
    ++keyy) { // __m256i lockk =
    _mm256_castps_si256(_mm256_broadcast_ss((const float*)&keylocks[k]));
    __m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
    float*)keyy)); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
    _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } } int res = 0; if (ni % 32) { uint32_t
    tmp[32]; const uint32_t* keyy_last = &keylocks[li & -32]; if (li % 32) {
    for (int k = 0; k < li % 32; ++k)
    tmp[k] = keyy_last[k];
    for (int k = li % 32; k < 32; ++k)
    tmp[k] = (uint32_t)-1;
    }
    const uint32_t* keyx_last = &keyx[ni % 32];
    int nz = 0;
    for (; keyx != keyx_last; keyx += 1) {
    if (*keyx) {
    __m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const float*)keyx)); for (const uint32_t* keyy = keylocks; keyy != keyy_last;
    keyy += 32) { __m256i lock0 = _mm256_loadu_si256((const
    __m256i*)&keyy[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyy[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyy[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyy[3*8]); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
    _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } if (li % 32) { __m256i lock0 = _mm256_loadu_si256((const __m256i*)&tmp[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&tmp[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&tmp[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&tmp[3*8]); res0 =
    _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
    lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } } else { nz += 1; } } res = nz * li; }
    // fold accumulators
    res0 = _mm256_add_epi32(res0, res2);
    res1 = _mm256_add_epi32(res1, res3);
    res0 = _mm256_add_epi32(res0, res1);
    res0 = _mm256_hadd_epi32(res0, res0);
    res0 = _mm256_hadd_epi32(res0, res0);

    res += _mm256_extract_epi32(res0, 0);
    res += _mm256_extract_epi32(res0, 4);
    return res;
    }









    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Fri Feb 7 22:27:03 2025
    On Fri, 7 Feb 2025 15:04:23 +0000, Michael S wrote:

    Here is my manual code. Handling of the tail is too clever. I did not
    have time to simplify. Otherwise, for 250x250 it should perform about
    the same as simpler code.

    #include <stdint.h>
    #include <immintrin.h>

    int foo_tst(const uint32_t* keylocks, int len, int li)
    {
    if (li >= len || li <= 0)
    return 0;
    const uint32_t* keyx = &keylocks[li];
    unsigned ni = len - li;
    __m256i res0 = _mm256_setzero_si256();
    __m256i res1 = _mm256_setzero_si256();
    __m256i res2 = _mm256_setzero_si256();
    __m256i res3 = _mm256_setzero_si256();
    const uint32_t* keyx_last = &keyx[ni & -32];
    for (; keyx != keyx_last; keyx += 32) {
    __m256i lock0 = _mm256_loadu_si256((const __m256i*)&keyx[0*8]);
    __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyx[1*8]);
    __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyx[2*8]);
    __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyx[3*8]);
    // for (int k = 0; k < li; ++k) {
    // for (int k = 0, nk = li; nk > 0; ++k, --nk) {
    for (const uint32_t* keyy = keylocks; keyy != &keylocks[li];
    ++keyy) { // __m256i lockk =
    _mm256_castps_si256(_mm256_broadcast_ss((const float*)&keylocks[k]));
    __m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
    float*)keyy)); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
    _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } } int res = 0; if (ni % 32) { uint32_t
    tmp[32]; const uint32_t* keyy_last = &keylocks[li & -32]; if (li % 32) {
    for (int k = 0; k < li % 32; ++k)
    tmp[k] = keyy_last[k];
    for (int k = li % 32; k < 32; ++k)
    tmp[k] = (uint32_t)-1;
    }
    const uint32_t* keyx_last = &keyx[ni % 32];
    int nz = 0;
    for (; keyx != keyx_last; keyx += 1) {
    if (*keyx) {
    __m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const float*)keyx)); for (const uint32_t* keyy = keylocks; keyy != keyy_last;
    keyy += 32) { __m256i lock0 = _mm256_loadu_si256((const
    __m256i*)&keyy[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyy[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyy[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyy[3*8]); res0 = _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
    _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } if (li % 32) { __m256i lock0 = _mm256_loadu_si256((const __m256i*)&tmp[0*8]); __m256i lock1 = _mm256_loadu_si256((const __m256i*)&tmp[1*8]); __m256i lock2 = _mm256_loadu_si256((const __m256i*)&tmp[2*8]); __m256i lock3 = _mm256_loadu_si256((const __m256i*)&tmp[3*8]); res0 =
    _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
    lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
    _mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
    _mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3, _mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
    _mm256_setzero_si256())); } } else { nz += 1; } } res = nz * li; }
    // fold accumulators
    res0 = _mm256_add_epi32(res0, res2);
    res1 = _mm256_add_epi32(res1, res3);
    res0 = _mm256_add_epi32(res0, res1);
    res0 = _mm256_hadd_epi32(res0, res0);
    res0 = _mm256_hadd_epi32(res0, res0);

    res += _mm256_extract_epi32(res0, 0);
    res += _mm256_extract_epi32(res0, 4);
    return res;
    }


    Simple question:: how would you port this code to a machine
    with a different SIMD instruction set ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Feb 14 21:52:39 2025
    On Fri, 14 Feb 2025 21:14:11 +0000, BGB wrote:

    On 2/13/2025 1:09 PM, Marcus wrote:
    -------------

    The problem arises when the programmer *deliberately* does unaligned
    loads and stores in order to improve performance. Or rather, if the
    programmer knows that the hardware supports unaligned loads and stores,
    he/she can use that to write faster code in some special cases.


    Pretty much.


    This is partly why I am in favor of potentially adding explicit keywords
    for some of these cases, or to reiterate:
    __aligned:
    Inform compiler that a pointer is aligned.
    May use a faster version if appropriate.
    If a faster aligned-only variant exists of an instruction.
    On an otherwise unaligned-safe target.
    __unaligned: Inform compiler that an access is unaligned.
    May use a runtime call or similar if necessary,
    on an aligned-only target.
    May do nothing on an unaligned-safe target.
    None: Do whatever is the default.
    Presumably, assume aligned by default,
    unless target is known unaligned-safe.

    It would take LESS total man-power world-wide and over-time to
    simply make HW perform misaligned accesses.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB on Fri Feb 21 14:51:34 2025
    BGB wrote:

    Can note that the latency of carry-select adders is a little weird:
    16/32/64: Latency goes up steadily;
    But, still less than linear;
    128-bit: Only slightly more latency than 64-bit.

    The best I could find in past testing was seemingly 16-bit chunks for
    normal adding. Where, 16-bits seemed to be around the break-even between
    the chained CARRY4's and the Carry-Select (CS being slower below 16 bits).

    But, for a 64-bit adder, still basically need to give it a clock-cycle
    to do its thing. Though, not like 32 is particularly fast either; hence
    part of the whole 2 cycle latency on ALU ops thing. Mostly has to do
    with ADD/SUB (and CMP, which is based on SUB).


    Admittedly part of why I have such mixed feelings on full
    compare-and-branch:
    Pro: It can offer a performance advantage (in terms of per-clock);
    Con: Branch is now beholden to the latency of a Subtract.

    IIRC your cpu clock speed is about 75 MHz (13.3 ns)
    and you are saying it takes 2 clocks for a 64-bit ADD.

    I don't remember what Xilinx chip you are using but this paper describes
    how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
    on a Virtex-5:

    A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010 https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/projects/project_1_resources/Adders_MELECON_2010.pdf

    and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:

    Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016 http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)