• Re: Arm ldaxr / stxr loop question

    From Scott Lurndal@21:1/5 to EricP on Sat Nov 9 14:23:47 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?

    LL/SC vs cmpxchg8b?

    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Sat Nov 9 23:18:14 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/8/2024 6:19 AM, Scott Lurndal wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:


    A real world example from the linux kernel:

    static __always_inline s64
    __ll_sc_atomic64_dec_if_positive(atomic64_t *v)
    {
    s64 result;
    unsigned long tmp;

    asm volatile("// atomic64_dec_if_positive\n"
    " prfm pstl1strm, %2\n"
    "1: ldxr %0, %2\n"
    " subs %0, %0, #1\n"
    " b.lt 2f\n"
    " stlxr %w1, %0, %2\n"
    " cbnz %w1, 1b\n"
    " dmb ish\n"

    "dmb ish" is interesting to me for some reason...

    Data Memory Barrior - inner sharable coherency domain

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sun Nov 10 01:26:22 2024
    On Sat, 9 Nov 2024 23:18:14 +0000, Scott Lurndal wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/8/2024 6:19 AM, Scott Lurndal wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:


    A real world example from the linux kernel:

    static __always_inline s64
    __ll_sc_atomic64_dec_if_positive(atomic64_t *v)
    {
    s64 result;
    unsigned long tmp;

    asm volatile("// atomic64_dec_if_positive\n"
    " prfm pstl1strm, %2\n"
    "1: ldxr %0, %2\n"
    " subs %0, %0, #1\n"
    " b.lt 2f\n"
    " stlxr %w1, %0, %2\n"
    " cbnz %w1, 1b\n"
    " dmb ish\n"

    "dmb ish" is interesting to me for some reason...

    Data Memory Barrior - inner sharable coherency domain

    It reads better without explanation ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun Nov 10 01:37:26 2024
    On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

    It reads better without explanation ...

    Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

    Can anybody find any other example of any IBM engineer ever having a sense
    of humour? Ever?

    Anybody?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Nov 10 02:44:39 2024
    On Sun, 10 Nov 2024 1:37:26 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

    It reads better without explanation ...

    Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

    Can anybody find any other example of any IBM engineer ever having a
    sense of humour?

    We got past our censors (management) a control register
    in Mc 88100 called FPECR -- Floating Point Exception
    Control Register. We were rather happy about it, too.

    Ed Rupp (wrote the 68020/30) µCode assembler. Due to the
    way we implemented µROM, we could interchange rows and
    columns to optimize various stuff. We (the engineers)
    got together one night and rearranged the rows and
    columns such that if you looked at µROM from a good
    distance back, you would see "Moto Man Lives" in bits
    across the ROM. ...
    Actually got in trouble for that one ...

    Ever?

    Anybody?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Sun Nov 10 16:00:23 2024
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Nov 10 23:08:21 2024
    On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction?

    The advantage is consuming OpCode space at breathtaking speed.
    Oh wait...

    Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    Because the memory model was not build with the notion of memory order
    and that not all ATOMIC events start or end with a recognizable inst-
    ruction. Having ATOMICs announce their beginning and ending eliminates
    the need for fencing; even if you keep a <relatively> relaxed memory
    order model.

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.

    Blame Leslie Lamport for those requirements.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Mon Nov 11 12:41:08 2024
    On Sun, 10 Nov 2024 16:00:23 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead
    of pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.



    The correct question is not "Why to have them?", but "Why not?".
    In the ISA with fixed 32-bit instructions and with 32 GPRs, opcode space
    for 2-reg operations without immediate is extremely cheap.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Mon Nov 11 13:57:55 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence >synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.


    "Limited ordering regions allow large systems to perform
    special Load-Acquire and Store-Release instructions that
    provide order between the memory accesses to a region of
    the PA map as observed by a limited set of observers."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Nov 11 13:59:22 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>> pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction?

    The advantage is consuming OpCode space at breathtaking speed.
    Oh wait...

    Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    Because the memory model was not build with the notion of memory order
    and that not all ATOMIC events start or end with a recognizable inst- >ruction. Having ATOMICs announce their beginning and ending eliminates
    the need for fencing; even if you keep a <relatively> relaxed memory
    order model.

    There are fully atomic instructions, the load/store exclusives are
    generally there for backward compatability with armv7; the full set
    of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
    ARMv8.1.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Scott Lurndal on Mon Nov 11 09:56:44 2024
    On 11/11/24 08:59, Scott Lurndal wrote:


    There are fully atomic instructions, the load/store exclusives are
    generally there for backward compatability with armv7; the full set
    of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
    ARMv8.1.


    They added the atomics for scalability allegedly. ARM never
    stated what the actual issue was. I suspect they couldn't
    guarantee a memory lock size small enough to eliminate
    destructive interference. Like cache line size instead
    of word size.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Mon Nov 11 16:28:48 2024
    On Mon, 11 Nov 2024 13:59:22 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:

    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC
    instead of pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.

    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction?

    The advantage is consuming OpCode space at breathtaking speed.
    Oh wait...

    Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    Because the memory model was not build with the notion of memory
    order and that not all ATOMIC events start or end with a
    recognizable inst- ruction. Having ATOMICs announce their beginning
    and ending eliminates the need for fencing; even if you keep a
    <relatively> relaxed memory order model.

    There are fully atomic instructions, the load/store exclusives are
    generally there for backward compatability with armv7; the full set
    of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
    ARMv8.1.


    Also for compatibility with Cortex-A53 which is still a significant
    part of installed base.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Mon Nov 11 11:30:46 2024
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>> pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).
    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.
    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence
    synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.


    "Limited ordering regions allow large systems to perform
    special Load-Acquire and Store-Release instructions that
    provide order between the memory accesses to a region of
    the PA map as observed by a limited set of observers."

    Ok, so that explains LoadLOAcquire, StoreLORelease as they are
    functionally different: it needs to associate the fence with specific
    load and store addresses so it can determine a physical LORegion,
    if any, and thereby limit the scope of the fence actions to that LOR.

    But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
    Why attach a specific kind of fence action to the general LD or ST?
    They do the same thing in the atomic instructions, eg:

    LDADDB, LDADDAB, LDADDALB, LDADDLB
    Atomic add on byte in memory atomically loads an 8-bit byte from memory,
    adds the value held in a register to it, and stores the result back to
    memory. The value initially loaded from memory is returned in the
    destination register.
    - If the destination register is not WZR, LDADDAB and LDADDALB load from
    memory with acquire semantics.
    - LDADDLB and LDADDALB store to memory with release semantics.
    - LDADDB has neither acquire nor release semantics.

    And this goes on and on for all the other atomic ops, SWP, CAS, CLR, EOR,
    SET, SMIN, SMAX, UMIN, UMAX, and data sizes, half, word, dblword, pair.

    What happens if like Apple you want Processor Consistency model too -
    instead of just adding one new fence instruction, do they have to add
    all the atomic instructions (ops * sizes) in again?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Mon Nov 11 17:11:10 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>>> pessimistic CAS RMW type of logic?
    LL/SC vs cmpxchg8b?
    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).
    Aarch64 also has CASP, a 128-bit atomic compare and swap
    instruction.
    Thanks, I missed that.

    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    The execution time of each is the same, and the main cost is the fence
    synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.


    "Limited ordering regions allow large systems to perform
    special Load-Acquire and Store-Release instructions that
    provide order between the memory accesses to a region of
    the PA map as observed by a limited set of observers."

    Ok, so that explains LoadLOAcquire, StoreLORelease as they are
    functionally different: it needs to associate the fence with specific
    load and store addresses so it can determine a physical LORegion,
    if any, and thereby limit the scope of the fence actions to that LOR.

    But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
    Why attach a specific kind of fence action to the general LD or ST?
    They do the same thing in the atomic instructions, eg:

    Note that the atomics were added in V8.1, and were optional at that
    time.

    From the ARMv8 ARM:

    Arm provides a set of instructions with Acquire semantics for
    loads, and Release semantics for stores. These instructions
    support the Release Consistency sequentially consistent (RCsc) model.
    In addition, FEAT_LRCPC provides Load-AcquirePC instructions. The
    combination of Load-AcquirePC and Store-Release can be use to
    support the weaker Release Consistency processor consistent (RCpc) model.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to jseigh on Mon Nov 11 17:17:56 2024
    jseigh <jseigh_es00@xemaps.com> writes:
    On 11/11/24 08:59, Scott Lurndal wrote:


    There are fully atomic instructions, the load/store exclusives are
    generally there for backward compatability with armv7; the full set
    of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
    ARMv8.1.


    They added the atomics for scalability allegedly. ARM never
    stated what the actual issue was. I suspect they couldn't
    guarantee a memory lock size small enough to eliminate
    destructive interference. Like cache line size instead
    of word size.

    Speculation is seldom accurate. I would suggest that it
    is more likely that there were requests from ARM customers
    who were looking to build larger SMP systems and it had been
    clear for decades that LL/SC could not scale to larger
    processor counts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Malcolm Beattie@21:1/5 to Lawrence D'Oliveiro on Mon Nov 11 18:17:54 2024
    On 2024-11-10, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

    It reads better without explanation ...

    Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).

    Can anybody find any other example of any IBM engineer ever having a sense
    of humour? Ever?

    One of the resource types in JES2, the batch subsystem for z/OS, is
    BERT ("Block Extension Reuse Table") and needs some sizing/tuning by
    the sysprog. Not too noticeable as humourous but for low-level use
    from Assembler some of the macros which manipulate them allow you to
    (1) copy one into memory, i.e. "Deliver Or Get" a BERT
    (2) define a hook to get control when a BERT is released, i.e
    "Do It Later" for a BERT release.
    (3) generate a control block for a related data area, i.e. a
    "Collector Attribute Table" for BERTs.

    These macros are
    (1) $DOGBERT
    (2) $DILBERT
    (3) $CATBERT

    --Malcolm

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Nov 26 01:28:11 2024
    On Fri, 22 Nov 2024 15:45:20 +0000, Scott Lurndal wrote:

    aph@littlepinkcloud.invalid writes:
    Kent Dickey <kegs@provalid.com> wrote:
    ------------------------
    So it seems. I think everything in DDI0487J was meant to be there in >>DDI0487K, but it looks like it's all been macro-expanded and some
    things fell off the page, because reasons.

    Between DDI0487G and DDI0487H, they completely rewrote the ARM
    using a requirements based description rather than the straightforward
    prose in prior editions.

    It takes the average (diligent) engineer 30-35 years to learn how to
    specify things where a well meaning designer cannot misunderstand what
    was said and written.

    Good documents are very hard indeed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to EricP on Tue Nov 12 12:14:47 2024
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Any idea what is the advantage for them having all these various
    LDxxx and STxxx instructions that only seem to combine a LD or ST
    with a fence instruction? Why have
    LDAPR Load-Acquire RCpc Register
    LDAR Load-Acquire Register
    LDLAR LoadLOAcquire Register

    plus all the variations for byte, half, word, and pair,
    instead of just the standard LDx and a general data fence instruction?

    All this, and much more can be discovered by reading the AMBA
    specifications. However, the main point is that the content of the
    target address does not have to be transferred to the local cache:
    these are remote atomic operations. Quite nice for things like
    fire-and-forget counters, for example.

    The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
    comms queue and waiting for all outstanding cache ops to finish.

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Malcolm Beattie on Tue Nov 12 13:55:25 2024
    Malcolm Beattie <mbeattie@clueful.co.uk> schrieb:
    On 2024-11-10, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:

    It reads better without explanation ...

    Reminds me of the “EIEIO” instruction from IBM POWER (or was it only
    PowerPC).

    Can anybody find any other example of any IBM engineer ever having a sense >> of humour? Ever?

    One of the resource types in JES2, the batch subsystem for z/OS, is
    BERT ("Block Extension Reuse Table") and needs some sizing/tuning by
    the sysprog. Not too noticeable as humourous but for low-level use
    from Assembler some of the macros which manipulate them allow you to
    (1) copy one into memory, i.e. "Deliver Or Get" a BERT
    (2) define a hook to get control when a BERT is released, i.e
    "Do It Later" for a BERT release.
    (3) generate a control block for a related data area, i.e. a
    "Collector Attribute Table" for BERTs.

    These macros are
    (1) $DOGBERT
    (2) $DILBERT
    (3) $CATBERT

    Do you know if these macros existed before 1993, when Dilbert was
    first released?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Chris M. Thomasson on Tue Nov 12 23:02:13 2024
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to aph@littlepinkcloud.invalid on Tue Nov 12 18:55:42 2024
    On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Andrew.

    Does ARM use acquire and release differently than everyone else?
    I'm not sure where StoreLoad fits in with those.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to aph@littlepinkcloud.invalid on Wed Nov 13 00:29:42 2024
    On Tue, 12 Nov 2024 23:02:13 +0000, aph@littlepinkcloud.invalid wrote:

    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    1990-1992: I was working on Mc88120. It had a conditional cache--a
    place to store store-data until the store instruction became consistent.
    After becoming consistent, the store data would migrate to L1 or on
    to DRAM, ... This structure could be probed for memory order rather
    similar to what ARM is doing.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to aph@littlepinkcloud.invalid on Wed Nov 13 07:37:46 2024
    aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:

    One other thing to be aware of is that the StoreLoad barrier needed
    for sequential consistency is logically part of an LDAR, not part of a
    STLR. This is an optimization, because the purpose of a StoreLoad in
    that situation is to prevent you from seeing your own stores to a
    location before everyone else sees them.

    Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
    store followed by a load to another location to hold. LoadStore is
    not strong enough. The SMR algorithm needs that. Iirc, Peterson's
    algorithms needs it as well.

    That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Isn't this just reusing the normal forwarding network?

    If not found, you do as usual and start a regular load operation, but
    now you also know that you can skip the flushing of the same?

    PS. I do agree that it is a good idea (even patent-worthy?), but not
    brilliant since it is so very obvious in hindsight.

    To me brilliant is something that still isn't obvious after larning
    about it.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Terje Mathisen on Wed Nov 13 18:13:04 2024
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:


    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't
    necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Isn't this just reusing the normal forwarding network?

    If not found, you do as usual and start a regular load operation, but
    now you also know that you can skip the flushing of the same?

    Yes. As long as the data in the store buffer doesn't overlap with what
    you're about to write, you can ship the flushing.

    PS. I do agree that it is a good idea (even patent-worthy?), but not brilliant since it is so very obvious in hindsight.

    LOL! :-)


    To me brilliant is something that still isn't obvious after larning
    about it.

    You have very high standards.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to jseigh on Wed Nov 13 18:07:17 2024
    jseigh <jseigh_es00@xemaps.com> wrote:
    On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:

    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't
    necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Does ARM use acquire and release differently than everyone else?
    I'm not sure where StoreLoad fits in with those.

    Yes. LDAR and STLR, used together, are sequentially consistent. This
    is a stronger guarantee than acquire and release.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to Scott Lurndal on Thu Nov 14 06:24:32 2024
    In article <YfxXO.384093$EEm7.56154@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    Do read B2.3 Definition of the Arm memory model. It's only 32 pages,
    and very clearly defines the memory model.

    Your definition of "clearly" differs from mine.

    Look at Pick dependencies on page B2-239 and B2-240:
    (I'm replacing complicating details with "blah blah" or "A, B, C", to
    highlight the issue I want to point out)

    ---
    Pick Basic dependency:
    There is A, B, C, or a Pick dependency between E1 and E2
    Pick Data dependency:
    There is a Pick Basic dependency from E1 to E2 and blah blah.
    Pick Address dependency:
    There is a Pick Data dependency from E1 to E3 and E2 is blah blah
    Pick Control dependency:
    This is a Pick Basic dependency from E1 to E3 and E2 is blah blah
    Pick Dependency:
    There is a Pick Basic, Pick Address, Pick Data, or Pick Control
    dependency from E1 to E2
    ---

    This is completely circular, and never defines what "pick" is.

    Even better, let's look at the actual words for Pick Basic Dependency:

    ---
    Pick Basic Dependency:
    There is a Pick Basic dependency from an effect E1 to an effect
    E2 if one of the following applies:
    1) One of the following applies:
    a) E1 is an Explicit Memory Read effect
    b) E1 is a Register Read effect
    2) One of the following applies:
    a) There is a Pick dependency through registers and memory
    from E1 to E2
    b) E1 and E2 are the same effect
    ---
  • From aph@littlepinkcloud.invalid@21:1/5 to Kent Dickey on Thu Nov 14 09:23:23 2024
    Kent Dickey <kegs@provalid.com> wrote:

    Even better, let's look at the actual words for Pick Basic Dependency:

    ---
    Pick Basic Dependency:
    There is a Pick Basic dependency from an effect E1 to an effect
    E2 if one of the following applies:
    1) One of the following applies:
    a) E1 is an Explicit Memory Read effect
    b) E1 is a Register Read effect
    2) One of the following applies:
    a) There is a Pick dependency through registers and memory
    from E1 to E2
    b) E1 and E2 are the same effect

    I don't understand this. However, here are the actual words:

    Pick Basic dependency

    A Pick Basic dependency from a read Register effect or read Memory
    effect R1 to a Register effect or Memory effect E2 exists if one
    of the following applies:

    • There is a Dependency through registers and memory from R1 to E2.
    • There is an Intrinsic Control dependency from R1 to E2.
    • There is a Pick Basic dependency from R1 to an Effect E3 and
    there is a Pick Basic dependency from E3 to E2.

    Seems reasonable enough in context, no? It's either a data dependency,
    a control dependency, or any transitive combination of them.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to aph@littlepinkcloud.invalid on Thu Nov 14 10:41:14 2024
    aph@littlepinkcloud.invalid wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    aph@littlepinkcloud.invalid wrote:
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:


    That's right, but my point about LDAR on AArch64 is that you can get
    sequential consistency without needing a StoreLoad. LDAR can peek
    inside the store buffer and, much of the time, determine that it isn't
    necessary to do a flush. I don't know if Arm were the first to do
    this, but I don't recall seeing it before. It is a brilliant idea.

    Isn't this just reusing the normal forwarding network?

    If not found, you do as usual and start a regular load operation, but
    now you also know that you can skip the flushing of the same?

    Yes. As long as the data in the store buffer doesn't overlap with what
    you're about to write, you can ship the flushing.

    PS. I do agree that it is a good idea (even patent-worthy?), but not
    brilliant since it is so very obvious in hindsight.

    LOL! :-)


    To me brilliant is something that still isn't obvious after larning
    about it.

    You have very high standards.

    That is one of the reasons I never started a PhD track, I could never
    find an area of study that I thought would be sufficiently ground-breaking.

    The other reason is/was that my friend Andy "Crazy" Glew did try the PhD
    route for several years and hit the same stumbling block vs his
    advisors, and I know that Andy is an idea machine well beyond myself.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to aph@littlepinkcloud.invalid on Thu Nov 21 05:46:56 2024
    In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>,
    <aph@littlepinkcloud.invalid> wrote:
    Kent Dickey <kegs@provalid.com> wrote:

    Even better, let's look at the actual words for Pick Basic Dependency:

    ---
    Pick Basic Dependency:
    There is a Pick Basic dependency from an effect E1 to an effect
    E2 if one of the following applies:
    1) One of the following applies:
    a) E1 is an Explicit Memory Read effect
    b) E1 is a Register Read effect
    2) One of the following applies:
    a) There is a Pick dependency through registers and memory
    from E1 to E2
    b) E1 and E2 are the same effect

    I don't understand this. However, here are the actual words:

    Pick Basic dependency

    A Pick Basic dependency from a read Register effect or read Memory
    effect R1 to a Register effect or Memory effect E2 exists if one
    of the following applies:

    • There is a Dependency through registers and memory from R1 to E2.
    • There is an Intrinsic Control dependency from R1 to E2.
    • There is a Pick Basic dependency from R1 to an Effect E3 and
    there is a Pick Basic dependency from E3 to E2.

    Seems reasonable enough in context, no? It's either a data dependency,
    a control dependency, or any transitive combination of them.

    Andrew.

    Where did you get that from? I cannot find it in the current Arm document DDI0487K_a_a-profile-architecture_reference_manual.pdf. Get it from https://developer.arm.com/documentation/ddi0487/ka/?lang=en

    My text for Pick Basic dependency is a quote (where I label the lines
    1a,1b, etc., where it's just bullets in the Arm document) from page B2-239, middle of the page.

    That sort of "summary" was exactly what I was asking for, but I don't see it, so can you please name the page?

    I'm pretty sure there are confusing typos all through this section
    (E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
    was a doozy.

    It's likely the wording was better in an earlier document, I've noticed
    this section getting more opaque over time.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to jseigh on Thu Oct 31 19:12:43 2024
    On Mon, 28 Oct 2024 19:13:03 +0000, jseigh wrote:

    So if were to implement a spinlock using the above instructions
    something along the lines of

    ..L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The "lock" operation has memory order acquire semantics and
    we see that in part in the ldaxr but the store isn't part
    of that. We could append an additional acquire memory barrier
    but would that be necessary.

    Loads from the locked critical region could move forward of
    the stxr but there's a control dependency from cbnz branch
    instruction so they would be speculative loads until the
    loop exited.

    You'd still potentially have loads before the store of
    the lockword but in this case that's not a problem
    since it's known the lockword was 0 and no stores
    from prior locked code could occur.

    This should be analogous to rmw atomics like CAS but
    I've no idea what the internal hardware implementations
    are. Though on platforms without CAS the C11 atomics
    are implemented with LD/SC logic.

    Is this sort of what's going on or is the explicit
    acquire memory barrier still needed?

    Joe Seigh

    My guess is that so few of us understand ARM fence
    mechanics that we cannot address teh asked question.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Oct 31 20:35:58 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 28 Oct 2024 19:13:03 +0000, jseigh wrote:

    So if were to implement a spinlock using the above instructions
    something along the lines of

    ..L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The "lock" operation has memory order acquire semantics and
    we see that in part in the ldaxr but the store isn't part
    of that. We could append an additional acquire memory barrier
    but would that be necessary.

    Loads from the locked critical region could move forward of
    the stxr but there's a control dependency from cbnz branch
    instruction so they would be speculative loads until the
    loop exited.

    You'd still potentially have loads before the store of
    the lockword but in this case that's not a problem
    since it's known the lockword was 0 and no stores
    from prior locked code could occur.

    This should be analogous to rmw atomics like CAS but
    I've no idea what the internal hardware implementations
    are. Though on platforms without CAS the C11 atomics
    are implemented with LD/SC logic.

    Is this sort of what's going on or is the explicit
    acquire memory barrier still needed?

    Joe Seigh

    My guess is that so few of us understand ARM fence
    mechanics that we cannot address teh asked question.

    Load-Acquire Exclusive Register derives an address from a base
    register value, loads a 32-bit word or 64-bit doubleword from memory,
    and writes it to a register. The memory access is atomic. The PE marks
    the physical address being accessed as an exclusive access. This exclusive
    access mark is checked by Store Exclusive instructions. See Synchronization
    and semaphores. The instruction also has memory ordering semantics as
    described in Load-Acquire, Load-AcquirePC, and Store-Release. For
    information about memory accesses, see Load/store addressing modes.


    Arm provides a set of instructions with Acquire semantics for loads,
    and Release semantics for stores. These instructions support the
    Release Consistency sequentially consistent (RCsc) model. In addition,
    FEAT_LRCPC provides Load-AcquirePC instructions. The combination of
    Load-AcquirePC and Store-Release can be use to support the weaker Release
    Consistency processor consistent (RCpc) model.

    The full definitions of the Load-Acquire and Load-AcquirePC instructions
    are covered formally in the Definition of the Arm memory model.

    https://developer.arm.com/documentation/102105/latest/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to jseigh on Fri Nov 8 03:17:51 2024
    On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:

    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The closest I could find to this was on page 8367
    of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:

    Loop
    LDAXR W5, [X1] ; read lock with acquire
    CBNZ W5, Loop ; check if 0
    STXR W5, W0, [X1] ; attempt to store new value
    CBNZ W5, Loop ; test if store succeeded and retry if not

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Chris M. Thomasson on Thu Nov 14 23:20:02 2024
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 11/14/2024 1:23 AM, aph@littlepinkcloud.invalid wrote:

    Seems reasonable enough in context, no? It's either a data dependency,
    a control dependency, or any transitive combination of them.

    data dependencies as in stronger than a Dec Alpha does not not honor
    data dependent loads?

    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Kent Dickey on Thu Nov 21 17:41:33 2024
    Kent Dickey <kegs@provalid.com> wrote:
    In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>, <aph@littlepinkcloud.invalid> wrote:
    Kent Dickey <kegs@provalid.com> wrote:

    Even better, let's look at the actual words for Pick Basic Dependency:

    ---
    Pick Basic Dependency:
    There is a Pick Basic dependency from an effect E1 to an effect
    E2 if one of the following applies:
    1) One of the following applies:
    a) E1 is an Explicit Memory Read effect

    You're right, they do seem to have forgotten to define Explicit Memory
    Read effect. I'm sure they meant to.

    b) E1 is a Register Read effect
    2) One of the following applies:
    a) There is a Pick dependency through registers and memory >>> from E1 to E2
    b) E1 and E2 are the same effect

    I don't understand this. However, here are the actual words:

    Pick Basic dependency

    A Pick Basic dependency from a read Register effect or read Memory
    effect R1 to a Register effect or Memory effect E2 exists if one
    of the following applies:

    . There is a Dependency through registers and memory from R1 to E2. >> . There is an Intrinsic Control dependency from R1 to E2.
    . There is a Pick Basic dependency from R1 to an Effect E3 and
    there is a Pick Basic dependency from E3 to E2.

    Seems reasonable enough in context, no? It's either a data dependency,
    a control dependency, or any transitive combination of them.

    Where did you get that from? I cannot find it in the current Arm document DDI0487K_a_a-profile-architecture_reference_manual.pdf. Get it from https://developer.arm.com/documentation/ddi0487/ka/?lang=en

    Err, the previous version of the same document. :-)

    My text for Pick Basic dependency is a quote (where I label the lines
    1a,1b, etc., where it's just bullets in the Arm document) from page B2-239, middle of the page.

    That sort of "summary" was exactly what I was asking for, but I don't see it, so can you please name the page?

    B2-174 in DDI0487J

    I'm pretty sure there are confusing typos all through this section
    (E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
    was a doozy.

    It's likely the wording was better in an earlier document, I've noticed
    this section getting more opaque over time.

    So it seems. I think everything in DDI0487J was meant to be there in
    DDI0487K, but it looks like it's all been macro-expanded and some
    things fell off the page, because reasons. I believe the author of the
    earlier, easier-to-read version of the Memory Model left Arm for
    another company. If it's any consolation, the version of the MM before
    he rewrote it was absolutely incomprehensible.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to jseigh on Fri Nov 1 16:17:49 2024
    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The "lock" operation has memory order acquire semantics and
    we see that in part in the ldaxr but the store isn't part
    of that. We could append an additional acquire memory barrier
    but would that be necessary.

    After the store exclusive, you mean? No, it would not be necessary.

    This should be analogous to rmw atomics like CAS but
    I've no idea what the internal hardware implementations
    are. Though on platforms without CAS the C11 atomics
    are implemented with LD/SC logic.

    Is this sort of what's going on or is the explicit
    acquire memory barrier still needed?

    All of the implementations of things like POSIX mutexes I've seen on
    AArch64 use acquire alone.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Fri Nov 8 14:19:16 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:

    So if were to implement a spinlock using the above instructions
    something along the lines of

    .L0
    ldaxr -- load lockword exclusive w/ acquire membar
    cmp -- compare to zero
    bne .LO -- loop if currently locked
    stxr -- store 1
    cbnz .LO -- retry if stxr failed

    The closest I could find to this was on page 8367
    of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:

    DDI0487K_a is the most recent.


    Loop
    LDAXR W5, [X1] ; read lock with acquire
    CBNZ W5, Loop ; check if 0
    STXR W5, W0, [X1] ; attempt to store new value
    CBNZ W5, Loop ; test if store succeeded and retry if not

    A real world example from the linux kernel:

    static __always_inline s64
    __ll_sc_atomic64_dec_if_positive(atomic64_t *v)
    {
    s64 result;
    unsigned long tmp;

    asm volatile("// atomic64_dec_if_positive\n"
    " prfm pstl1strm, %2\n"
    "1: ldxr %0, %2\n"
    " subs %0, %0, #1\n"
    " b.lt 2f\n"
    " stlxr %w1, %0, %2\n"
    " cbnz %w1, 1b\n"
    " dmb ish\n"
    "2:"
    : "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
    :
    : "cc", "memory");

    return result;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Fri Nov 8 22:45:51 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
    On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of


    Fwiw, I am basically asking if the "store" stxr has implied acquire
    semantics wrt the "load" ldaxr? I am guess that it does... This would
    imply that the acquire membar (#LoadStore | #LoadLoad) would be
    respected by the store at stxr wrt its "attached?" load wrt ldaxr?

    Is this basically right? Or, what am I missing here? Thanks.

    The membar logic wrt acquire needs to occur _after_ the atomic logic
    that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
    needs to occur _before_ the atomic logic that unlocks said spinlock.

    Am I missing anything wrt ARM? ;^o

    Did you read the extensive description of memory semantics
    in the ARMv8 ARM? See page 275 in DDI0487K_a.

    https://developer.arm.com/documentation/ddi0487/ka/?lang=en

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Fri Nov 8 23:36:24 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?

    LL/SC vs cmpxchg8b?

    The two concepts are orthogonal in my experience.

    ARM saw the deficiences of LL/SC very early in the
    V8 architectural definition, and added a set of
    atomic instructions for scalability to large processor
    counts - one advantage is that the atomic operations
    can be delegated to a cache level or memory, thus potentially
    a very minor power savings in cases where contention is
    common (although such LL/SC try loops often include the ARM
    equivalent of the x86 PAUSE or MWAIT instructions to
    all power savings during the spin).

    Do read B2.3 Definition of the Arm memory model. It's only 32 pages,
    and very clearly defines the memory model.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Fri Nov 8 19:34:55 2024
    On 11/8/24 17:56, Chris M. Thomasson wrote:
    On 11/8/2024 2:45 PM, Scott Lurndal wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
    On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
    jseigh <jseigh_es00@xemaps.com> wrote:
    So if were to implement a spinlock using the above instructions
    something along the lines of


    Fwiw, I am basically asking if the "store" stxr has implied acquire
    semantics wrt the "load" ldaxr? I am guess that it does... This would
    imply that the acquire membar (#LoadStore | #LoadLoad) would be
    respected by the store at stxr wrt its "attached?" load wrt ldaxr?

    Is this basically right? Or, what am I missing here? Thanks.

    The membar logic wrt acquire needs to occur _after_ the atomic logic
    that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
    needs to occur _before_ the atomic logic that unlocks said spinlock.

    Am I missing anything wrt ARM? ;^o

    Did you read the extensive description of memory semantics
    in the ARMv8 ARM?   See page 275 in DDI0487K_a.

    https://developer.arm.com/documentation/ddi0487/ka/?lang=en

    I did not! So I am flying a mostly blind here. I don't really have any experience with how ARM handles these types of things. Just guessing
    that the store would honor the acquire of the load? Or, does the store
    need a membar and the load does not need acquire at all? I know that the membar should be after the final store that actually locks the spinlock
    wrt Joe's example.

    I just need to RTFM!!!!

    Sorry about that Scott. ;^o

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of pessimistic CAS RMW type of logic?


    In this case the the stxr doesn't need a memory barrier.
    Loads can move forward of it but not forward of the ldaxr
    because it has acquire semantics. For a lock that's ok
    since the stxr would fail if any other thread acquired
    the lock the conditional branch would make the loads
    speculative if the stxr failed I believe.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Chris M. Thomasson on Fri Nov 8 21:00:53 2024
    Chris M. Thomasson wrote:
    On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:

    Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
    pessimistic CAS RMW type of logic?

    LL/SC vs cmpxchg8b?

    Arm A64 has LDXP Load Exclusive Pair of registers and
    STXP Store Exclusive Pair of registers looks like it can be
    equivalent to cmpxchg16b (aka double-wide compare and swap).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to aph@littlepinkcloud.invalid on Fri Nov 22 15:45:20 2024
    aph@littlepinkcloud.invalid writes:
    Kent Dickey <kegs@provalid.com> wrote:
    In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>,
    <aph@littlepinkcloud.invalid> wrote:
    Kent Dickey <kegs@provalid.com> wrote:

    Even better, let's look at the actual words for Pick Basic Dependency:

    That sort of "summary" was exactly what I was asking for, but I don't see it,
    so can you please name the page?

    B2-174 in DDI0487J

    I'm pretty sure there are confusing typos all through this section
    (E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
    was a doozy.

    It's likely the wording was better in an earlier document, I've noticed
    this section getting more opaque over time.

    So it seems. I think everything in DDI0487J was meant to be there in >DDI0487K, but it looks like it's all been macro-expanded and some
    things fell off the page, because reasons.

    Between DDI0487G and DDI0487H, they completely rewrote the ARM
    using a requirements based description rather than the straightforward
    prose in prior editions.

    They've been wordsmithing it in every subsequent version.

    I consider the prose version to be more readable, myself.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)