• Re: ARM CAS vs LL/SC

    From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue May 26 12:44:20 2026
    From Newsgroup: comp.arch

    On 5/24/2026 2:24 PM, Paul Clayton wrote:
    On 5/21/26 4:17 PM, Scott Lurndal wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    ARM architectures allow (but don't encourage) a reservation
    granule that covers the entire address space (e.g. see the
    ARMv7 ARM).

    Any larger granule assures correctness but hinders performance.
    A global lock works but does not allow much parallelism.

    A large granule then we need to worry about a single load from say via
    false sharing or something... Well, can that case the SC to fail?

    FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
    using a hashed lock where address of a target word is used to index into
    an array. Something akin to:

    https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 26 20:58:52 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 5/24/2026 2:24 PM, Paul Clayton wrote:
    On 5/21/26 4:17 PM, Scott Lurndal wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    ARM architectures allow (but don't encourage) a reservation
    granule that covers the entire address space (e.g. see the
    ARMv7 ARM).

    Any larger granule assures correctness but hinders performance.
    A global lock works but does not allow much parallelism.

    A large granule then we need to worry about a single load from say via
    false sharing or something... Well, can that case the SC to fail?

    Does this "LL/SC and other core instructions synchronization means" not
    fall from "desirable" when one has a complete set of to-memory() atomic
    actions {add, sub, and, or, xor, xchg, cmp, cas} which avoid all the
    quadratic and cubic interconnect traffic in the system which are the
    real point of slow synchronization ??!!?? while being guaranteed to
    work without an interference and can be done for both cacheable and
    unCacheable memory accesses ??!!??

    FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
    using a hashed lock where address of a target word is used to index into
    an array. Something akin to:

    https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue May 26 14:00:36 2026
    From Newsgroup: comp.arch

    On 5/26/2026 1:58 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 5/24/2026 2:24 PM, Paul Clayton wrote:
    On 5/21/26 4:17 PM, Scott Lurndal wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    ARM architectures allow (but don't encourage) a reservation
    granule that covers the entire address space (e.g. see the
    ARMv7 ARM).

    Any larger granule assures correctness but hinders performance.
    A global lock works but does not allow much parallelism.

    A large granule then we need to worry about a single load from say via
    false sharing or something... Well, can that case the SC to fail?

    Does this "LL/SC and other core instructions synchronization means" not
    fall from "desirable" when one has a complete set of to-memory() atomic actions {add, sub, and, or, xor, xchg, cmp, cas} which avoid all the quadratic and cubic interconnect traffic in the system which are the
    real point of slow synchronization ??!!?? while being guaranteed to
    work without an interference and can be done for both cacheable and unCacheable memory accesses ??!!??

    Take a look some S/HTM... A single load can cause a retry, and lead to
    live lock?




    FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
    using a hashed lock where address of a target word is used to index into
    an array. Something akin to:

    https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 27 14:25:19 2026
    From Newsgroup: comp.arch

    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/21/26 4:17 PM, Scott Lurndal wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    ARM architectures allow (but don't encourage) a reservation
    granule that covers the entire address space (e.g. see the
    ARMv7 ARM).

    Any larger granule assures correctness but hinders performance.
    A global lock works but does not allow much parallelism.

    The less specifically the size is defined, the less performance-
    portable software becomes. One can address this with something
    like RISC-V profiles, in which sizes can be more specific and
    software that cares will specify a target profile rather than an
    Architecture (version).

    Since granule size can influence what code is most efficient,
    even recompiling is not an excellent option. So for a class of
    applications, having a single target seems to make sense.

    Being able to test software on a development machine can also be
    useful, so desired performance compatibility might be broader
    than a application type.

    I feel there is relatively little to prevent LL/SC semantics
    from being extended to support multiple cache blocks (or, for
    small LL/SC code bodies, single words for conflicts with other
    atomic operations rCo normal loads and stores might still use
    cache block granularity to limit complexity and/or network
    overhead).

    It would be limiting to tie LL/SC to cache lines.

    It is not tying the operation to cache lines but to cache
    line granules in terms of external interference monitoring
    (and, in the case of a modest extension beyond traditional
    LL/SC, the scope of the read/write set).

    Atomics are independent of the cache, and can be used with
    both cacheable and non-cacheable memory as well as
    CXL and PCI Express devices.

    I am not certain that LL/SC (or an extended form of such)
    could not be used with "I/O" addresses. This merely requires
    the equivalent of one cache line "cache" (or the largest
    guaranteed size of a transaction) and some form of
    monitoring ("coherence") of such memory addresses.

    In the case of a simple operation, as has been stated before,
    the LL/SC sequence can be converted to the equivalent of an
    atomic instruction.

    If true in the general case (and I'm not sure I see how it
    can be), why bother to add the hardware to do so when
    atomics are generally superior, scalable, simpler to implement and
    higher performance?


    For other operations, I am not certain what semantics make
    sense. If a read at one address changes the behavior of another
    access, does "atomic" behavior mean that the later in program
    order access happens before the I/O agent changes the access
    behavior or does it mean that the atomic action blocks "ordinary
    software agents" but lets side effects caused by the action to
    occur in program order?

    Atomics ensure that the access is atomic with respect to
    all other accessors - ensuring that the other accessors
    will not see inconsistent data.

    Atomics can be used as a basis (e.g. atomic test&set) to
    guard a critical section, but they're also useful for
    adjusting shared counters et alia.

    My perception is that PCI-E atomics are not meant for
    non-idempotent storage. (I do not know how ARM atomic
    instructions handle such cases.

    See above.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed May 27 14:08:17 2026
    From Newsgroup: comp.arch

    On 5/20/2026 4:47 PM, Paul Clayton wrote:
    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the bus lock
    and still make forward progress... Sigh... A horrible LL/SC thing can
    live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a quality of implementation issue.

    Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
    guarantees. Using LL/SC to emulate them is a different story.


    A guarantee of forward progress is not very useful if the progress is glacially (or cosmologically) slow. ("We guarantee that the operation
    will complete before the heat death of the universe"ry|)

    A _guarantee_ of forward progress is ALWAYS important? Sorry for
    shouting. Shit. Knowing the size of the reservation granule is hyper
    important to help the software pad and align to remove any false sharing
    on said granule. No? But...

    Here's the deeper problem can rear its ugly head... Vendors often don't document it? Or they document it inconsistently across revisions? So
    even if you do everything right in principle, you're tuning against a
    number you had to dig out of a forum post or reverse engineer yourself.
    Scary! ;^o


    Of course, the temptation toward "good enough" (not so bad that one will lose too many customers) is a problem. I would expect
    documented guarantees of sufficient generality to have the cognitive
    load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.

    How many SC failures on a fetch-and-add are acceptable before you
    conclude something's fundamentally broken? For me the answer is: very few.




    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed May 27 14:14:11 2026
    From Newsgroup: comp.arch

    On 5/27/2026 2:08 PM, Chris M. Thomasson wrote:
    [...]
    How many SC failures on a fetch-and-add are acceptable before you
    conclude something's fundamentally broken? For me the answer is: very few.

    A LOCK XADD can be used for wait free algos, a LOCK XADD emulated with
    LL/SC cannot... ?


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed May 27 14:24:36 2026
    From Newsgroup: comp.arch

    On 5/27/2026 2:14 PM, Chris M. Thomasson wrote:
    On 5/27/2026 2:08 PM, Chris M. Thomasson wrote:
    [...]
    How many SC failures on a fetch-and-add are acceptable before you
    conclude something's fundamentally broken? For me the answer is: very
    few.

    A LOCK XADD can be used for wait free algos, a LOCK XADD emulated with
    LL/SC cannot... ?



    For x86, its "easier" for sure... pad _and_ align on a l2 cache line,
    and you should be ideal... SO NO straddle a cache line and execute a
    damn LOCK RMW on it. Bus lock for sure.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 28 01:27:36 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 5/20/2026 4:47 PM, Paul Clayton wrote:
    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the bus lock
    and still make forward progress... Sigh... A horrible LL/SC thing can
    live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a quality of implementation issue.

    Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
    guarantees. Using LL/SC to emulate them is a different story.


    A guarantee of forward progress is not very useful if the progress is glacially (or cosmologically) slow. ("We guarantee that the operation
    will complete before the heat death of the universe"ry|)

    A _guarantee_ of forward progress is ALWAYS important? Sorry for
    shouting. Shit. Knowing the size of the reservation granule is hyper important to help the software pad and align to remove any false sharing
    on said granule. No? But...

    Here's the deeper problem can rear its ugly head... Vendors often don't document it? Or they document it inconsistently across revisions? So
    even if you do everything right in principle, you're tuning against a
    number you had to dig out of a forum post or reverse engineer yourself. Scary! ;^o


    Of course, the temptation toward "good enough" (not so bad that one will lose too many customers) is a problem. I would expect
    documented guarantees of sufficient generality to have the cognitive
    load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.

    How many SC failures on a fetch-and-add are acceptable before you
    conclude something's fundamentally broken? For me the answer is: very few.

    Following a "SC failure" My 66000 provides a readable control register
    called 'WHY' which contains a number. Negative numbers represent kinds
    of failures {resource limit exceeded, time out, ...} while positive
    values indicate how far back in-line your request is (measured by a
    resource which has unique system-wide visibility to ATOMIC-order}.

    Thus, SW can use WHY to reach deeper into the Queue of pending work and
    select a unit that nobody else is going to go after on the next iteration.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun May 31 21:32:14 2026
    From Newsgroup: comp.arch

    On 5/27/26 5:08 PM, Chris M. Thomasson wrote:
    On 5/20/2026 4:47 PM, Paul Clayton wrote:
    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the
    bus lock and still make forward progress... Sigh... A
    horrible LL/SC thing can live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a
    quality of implementation issue.

    Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
    guarantees. Using LL/SC to emulate them is a different story.

    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations. IBM's constrained
    transactions guaranteed success of a transaction if it met
    certain criteria. A single-instruction LL/SC body could be
    Architecturally guaranteed to perform not only successfully but
    with some performance characteristics.

    A guarantee of forward progress is not very useful if the
    progress is glacially (or cosmologically) slow. ("We guarantee
    that the operation will complete before the heat death of the
    universe"ry|)

    A _guarantee_ of forward progress is ALWAYS important? Sorry for
    shouting. Shit. Knowing the size of the reservation granule is
    hyper important to help the software pad and align to remove any
    false sharing on said granule. No? But...

    I disagree. A guarantee that has a time scale beyond human
    civilization much less the lifetime of the hardware seems to
    have extremely little use. It may be reasonable to assume
    reasonable timescales for such guarantees, but a simple
    guarantee of eventual completion (if the system is kept
    operating) might be given if the profit motive seems sufficient.

    (I am not certain if even x86 XLOCK operations are absolutely
    guaranteed to complete in the presence of context switches. A
    hardware thread might be always be interrupted while it is
    performing the operation and if the hardware does not delay
    interrupt handling until after the operation completes, then the
    operation may never complete. This may be so extraordinarily
    improbable that an undetected error in ECC-protected memory
    might be more likely, in which case it is not really important.)

    I think one really wants the time scale explicitly declared as
    well as information about the range of latency and causes. Even
    5ms latency can seem like forever.

    Here's the deeper problem can rear its ugly head... Vendors
    often don't document it? Or they document it inconsistently
    across revisions? So even if you do everything right in
    principle, you're tuning against a number you had to dig out of
    a forum post or reverse engineer yourself. Scary! ;^o

    Ugh!

    Architecting a lot of such factors might help with documentation
    as Architecture is more stable than microarchitecture, but I do
    not think typical companies have the incentives for excellence
    in documentation. If the only consequence of mistakes in
    Architectural documentation is a few software developers
    grumbling, keeping even such stable documentation consistent and
    correct (and abiding by the old/existing Architectural contract)
    seems unlikely to seem important. In fact, if the inability to
    optimize forces people to buy more (or more expensive) hardware,
    poor documentation can mean higher profits.

    Of course, the temptation toward "good enough" (not so bad
    that one will lose too many customers) is a problem. I would
    expect
    documented guarantees of sufficient generality to have the
    cognitive load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.

    How many SC failures on a fetch-and-add are acceptable before
    you conclude something's fundamentally broken? For me the answer
    is: very few.

    Again, I think this is concerned with "quality of
    implementation" (and Architectural guarantees about such) than
    about the interface at an instruction level.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun May 31 23:26:39 2026
    From Newsgroup: comp.arch

    On 5/27/26 10:25 AM, Scott Lurndal wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    [snip]
    In the case of a simple operation, as has been stated before,
    the LL/SC sequence can be converted to the equivalent of an
    atomic instruction.

    If true in the general case (and I'm not sure I see how it
    can be), why bother to add the hardware to do so when
    atomics are generally superior, scalable, simpler to implement and
    higher performance?

    A more generic interface has some advantages.

    I already mentioned that old software that was developed when
    there was not an atomic ["expensive" operation] instruction
    could benefit from idiom recognition on new hardware. (An
    alternative to that would be patching or recompiling the
    software. While I prefer a more abstract software distribution
    format for its ability to avoid having to move things to
    Architecture and even potentially perform microarchitectural
    optimizations at non-instruction granularity, such seems
    unlikely to be common any time soon.)

    Even with atomic instructions, the Architecture generally does
    not provide guarantees about scalability. I doubt any
    implementation would stop-the-world to perform an atomic
    operation (because the performance penalty would be quite
    noticeable), but I can easily imagine an implementation
    waiting until the atomic operation is not speculative before
    starting it.

    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized. (System calls
    have similar excessive, in my opinion, latency. Some of this may
    be from cruft, but I received the impression that optimization
    effort is a significant cause for the higher latency.)

    I do not like the code bloat and decode complexity of using
    LL/SC for simple atomic operations. Unfortunately, even a LL-and-SC-after-next-compute instruction (which would allow
    arbitrary single compute instruction atomics and might be
    extended by function call instructions to microcode) would have
    the bloat of redundant register name encoding. Even a diversity
    of addressing modes may be excessive for atomic operations, if
    simple register-indirect with no offset is sufficiently common.

    With destructive operations (like x86), it would be possible to
    avoid the register name overhead by having the LL instruction
    not include a register name, taking it from the following
    compute instruction. For an LL instruction lacking a register
    name, if "microcode" calls were to be supported such call
    instructions would need to specify a register name (or use a
    defined, possibly function-specific ABI). An opcode-only LL
    might reasonably have space for hint/directive metadata, which
    might be useful.

    My objection to specific atomic instructions is mainly that
    they are specific. If an operation later becomes a reasonable
    target for such an instruction, a new instruction must be
    allocated to provide that operation. That new instruction would
    only be available to new software.

    For other operations, I am not certain what semantics make
    sense. If a read at one address changes the behavior of another
    access, does "atomic" behavior mean that the later in program
    order access happens before the I/O agent changes the access
    behavior or does it mean that the atomic action blocks "ordinary
    software agents" but lets side effects caused by the action to
    occur in program order?

    Atomics ensure that the access is atomic with respect to
    all other accessors - ensuring that the other accessors
    will not see inconsistent data.

    I think I communicated poorly. I was thinking about what the
    appropriate behavior of an atomic add operation (however
    encoded) should be when targeting an address with side effects.
    The simple choice is "don't do that" (undefined behavior). The
    slightly more complex choice is fault on bad behavior.

    Yet one might argue that targeting such an address for an atomic
    operation could be useful in some particular context. Supporting
    such means making a choice of how the side effect is handled.

    (I am inclined to just having such fault, but that needs to be
    defined as it means that acquiring a lock, performing a read,
    operating on the read value, writing the result, and releasing
    the lock is not functionally equivalent to an atomic operation.)

    Is the read side effect ignored? For side effects limited to the
    accessed address, this would seem to be the same as the side
    effect happening "between" the read and the write. For side
    effects with external effects, those would also be suppressed,
    making such different than having the side effect occur
    "between" the read and the write.

    Is the side effect done "between" the read and the write of the
    "atomic" operation? This would presumably overwrite the address-
    local side effect while producing other side effects, which
    might seem very strange as the side effect would use the old
    value for any value-dependent side effects.

    Is the side effect performed after the atomic operation? This
    could also be confusing.

    Even if the side effect does not change the value at the
    address, the value before or after the atomic operation might be
    used to determine what the side effect is.

    Removing side effects places atomics in a special category,
    which may be reasonable but is not a choice 100% obvious to
    everyone. Consistently and sensibly ordering side effects with
    atomic seems challenging.

    Such side effects are like atomic operations, which leads to a
    conflict. If the non-side effect operation is truly atomic, one
    might break the definition of the side effect.

    I would guess that each device would choose its supported
    behavior, but that would seem to add unnecessary complexity.
    Just faulting on such use seems sensible, but then one needs
    to distinguish between addresses that fault and addresses that
    allow atomic operations.

    I just looked it up, Power (version 2.06B) as an example
    restricts Load Reserved to coherent memory: "The storage
    location specified by the Load And Reserve and Store Conditional
    instructions must be in storage that is Memory Coherence
    Required if the location may be modified by another processor or
    mechanism. If the specified location is in storage that is Write
    Through Required or Caching Inhibited, the system data storage
    error handler or the system alignment error handler is invoked
    for the Server environment and may be invoked for the Embedded
    environment." I therefore suspect that even if such was
    extended to support PCI-E atomics, addresses with side effects
    would fault.

    Atomics can be used as a basis (e.g. atomic test&set) to
    guard a critical section, but they're also useful for
    adjusting shared counters et alia.

    (There seem to be a lot of alia/other uses. Atomic OR seems like
    a useful means of supporting multiple "named" read locks; if
    implemented aggressively, atomic OR could even be used for
    bit-sized locks in combination with atomic AND.)

    My perception is that PCI-E atomics are not meant for
    non-idempotent storage. (I do not know how ARM atomic
    instructions handle such cases.

    See above.

    The "above" statement was not clear to me. An I/O device's
    read side effect does not play nicely with the concept of
    atomic. One could define the atomic not to actually "read"
    the device register (no side effect), but I think one
    cannot just say the operation is atomic.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jun 2 01:27:53 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/27/26 10:25 AM, Scott Lurndal wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    [snip]
    In the case of a simple operation, as has been stated before,
    the LL/SC sequence can be converted to the equivalent of an
    atomic instruction.

    If true in the general case (and I'm not sure I see how it
    can be), why bother to add the hardware to do so when
    atomics are generally superior, scalable, simpler to implement and
    higher performance?

    A more generic interface has some advantages.

    I already mentioned that old software that was developed when
    there was not an atomic ["expensive" operation] instruction
    could benefit from idiom recognition on new hardware. (An
    alternative to that would be patching or recompiling the
    software. While I prefer a more abstract software distribution
    format for its ability to avoid having to move things to
    Architecture and even potentially perform microarchitectural
    optimizations at non-instruction granularity, such seems
    unlikely to be common any time soon.)

    Even with atomic instructions, the Architecture generally does
    not provide guarantees about scalability. I doubt any
    implementation would stop-the-world to perform an atomic
    operation (because the performance penalty would be quite
    noticeable), but I can easily imagine an implementation
    waiting until the atomic operation is not speculative before
    starting it.

    Understand that LOCK XADD [...] to MMI/O does exactly this !

    But note: XADD [...] never causes more than necessary bus traffic
    and as an atomic event, never fails, never needs retry, ...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jun 2 01:38:51 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/27/26 5:08 PM, Chris M. Thomasson wrote:
    On 5/20/2026 4:47 PM, Paul Clayton wrote:
    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the
    bus lock and still make forward progress... Sigh... A
    horrible LL/SC thing can live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a
    quality of implementation issue.

    Well, making a LOCK CAS, or say LOCK XADD, has certain inherent guarantees. Using LL/SC to emulate them is a different story.

    Academic LL/SC: I can agree with this statement. But neither ASF nor
    ESM has problems making stronger guarantees--and I did this over
    {7 ASF, 8 ESM} cache lines not 1 single memory location. These aslo
    impose limitation on instruction order and SW has to understand
    several nonVoneumann properties of the ATOMIC event.

    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    That standard academic stuff cannot, does not mean it absolutely
    cannot be done.

    IBM's constrained
    transactions guaranteed success of a transaction if it met
    certain criteria. A single-instruction LL/SC body could be
    Architecturally guaranteed to perform not only successfully but
    with some performance characteristics.

    A guarantee of forward progress is not very useful if the
    progress is glacially (or cosmologically) slow. ("We guarantee
    that the operation will complete before the heat death of the
    universe"ry|)

    A _guarantee_ of forward progress is ALWAYS important? Sorry for
    shouting. Shit. Knowing the size of the reservation granule is
    hyper important to help the software pad and align to remove any
    false sharing on said granule. No? But...

    I disagree. A guarantee that has a time scale beyond human
    civilization much less the lifetime of the hardware seems to
    have extremely little use. It may be reasonable to assume
    reasonable timescales for such guarantees, but a simple
    guarantee of eventual completion (if the system is kept
    operating) might be given if the profit motive seems sufficient.

    (I am not certain if even x86 XLOCK operations are absolutely
    guaranteed to complete in the presence of context switches. A
    hardware thread might be always be interrupted while it is
    performing the operation and if the hardware does not delay
    interrupt handling until after the operation completes, then the
    operation may never complete. This may be so extraordinarily
    improbable that an undetected error in ECC-protected memory
    might be more likely, in which case it is not really important.)

    I think one really wants the time scale explicitly declared as
    well as information about the range of latency and causes. Even
    5ms latency can seem like forever.

    Here's the deeper problem can rear its ugly head... Vendors
    often don't document it? Or they document it inconsistently
    across revisions? So even if you do everything right in
    principle, you're tuning against a number you had to dig out of
    a forum post or reverse engineer yourself. Scary! ;^o

    Ugh!

    Architecting a lot of such factors might help with documentation
    as Architecture is more stable than microarchitecture, but I do
    not think typical companies have the incentives for excellence
    in documentation. If the only consequence of mistakes in
    Architectural documentation is a few software developers
    grumbling, keeping even such stable documentation consistent and
    correct (and abiding by the old/existing Architectural contract)
    seems unlikely to seem important. In fact, if the inability to
    optimize forces people to buy more (or more expensive) hardware,
    poor documentation can mean higher profits.

    It took me more than 35 years to learn how to write -|Architecture
    documents such that a malevolent engineer could not misunderstand
    what was written and specified. Try it, it is not easy. It is not
    something that can be taught, but it is something that diligence
    and perseverance can deliver.

    Of course, the temptation toward "good enough" (not so bad
    that one will lose too many customers) is a problem. I would
    expect
    documented guarantees of sufficient generality to have the
    cognitive load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.

    How many SC failures on a fetch-and-add are acceptable before
    you conclude something's fundamentally broken? For me the answer
    is: very few.

    How many SC failures are acceptable if there are 1024 cores all
    going after the same lock ??

    Again, I think this is concerned with "quality of
    implementation" (and Architectural guarantees about such) than
    about the interface at an instruction level.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Tue Jun 2 14:42:12 2026
    From Newsgroup: comp.arch

    On 6/1/26 9:27 PM, MitchAlsup wrote:
    [snip]
    But note: XADD [...] never causes more than necessary bus traffic

    I am skeptical that this is Architecturally guaranteed. It may
    fall out of any even semi-sane implementation, in which case
    programmers might be willing to take it as guaranteed. Yet I
    suspect "sanity" may not be reliable with changing tradeoffs
    (including whether protecting a company's reputation has value).

    and as an atomic event, never fails, never needs retry, ...

    I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
    etc.) could provide such guarantees, even extending to multiple
    contiguous instructions operating on data within an aligned
    64-byte region.

    Interestingly, it seems that IBM's z17 is the last
    implementation to support constrained transactions. I do wonder
    why this feature has been removed from the Architecture.

    Constrained transactions had these restrictions (from https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
    | - The transaction executes no more than 32 instructions.
    | - All instructions within the transaction must be within 256
    | contiguous bytes of storage.
    | - The only branches you may use are relative branches that
    | branch forward (so there can be no loops).
    | - All SS and SSE-format instructions may not be used.
    | - Additional general instructions may not be used.
    | - The transaction's storage operands may not access more than
    | four octowords.
    | - The transaction may not access storage operands in any 4 |K-
    | byte blocks that contain the 256 bytes of storage beginning
    | with the TBEGINC instruction.
    | - Operand references must be within a single doubleword,
    | except for some of the "multiple" instructions for which the
    | limitation is a single octoword.

    I think I read that the first implementation made an optimistic
    attempt and later rCo I do not remember if multiple optimistic
    attempts were made rCo a hardware lock was used. Perhaps four
    addresses cause too much of a slowdown when there is conflict???

    I believe that guaranteeing completion would be substantially
    easier with only one aligned 64-byte region. (As I think I
    wrote before, adding a single "word" exportable atomic operation
    in a different "cache block" _might_ be practical to implement
    though I did not have an idea for software would express such.
    I may be wrong that appending such an exportable operation would
    not make ensuring completion significantly more difficult.)

    I think such guaranteed atomic sequences would require a
    distinct instruction not only to allow what IBM did (making such
    an illegal/faulting instruction) but also to fault when the
    instruction is misused since no fallback path is provided.

    There also seem to be other operations that would not (I think)
    be exceptionally difficult to guarantee. E.g., swapping cache
    blocks might not be much more difficult to guarantee than quick
    operations within a single cache block, though I do not know
    how useful such an unconditional swap would be. Atomic cache
    block copy would seem to be easier (it is similar to a block
    zeroing instruction except that the value is taken from a block
    that is not writeable by other agents being in exclusive or
    shared state). Guaranteeing atomicity for a copy into a cache
    block (where two contiguous cache blocks might be in the read
    set and the write is only to part of a cache block) seems a
    little more complicated.

    With conventional cache coherence, partial writes seem likely to
    be complex. If masked cache block updates were possible as an
    exportable atomic operation, it might be practical to lock (NAK-
    guard) a limited read set and push the update to the owner. I do
    not know if such an update independent of previous values in the
    written cache block would be useful.

    I am certainly not comfortable thinking about the visibility/
    ordering constraints, so my guesses are very wrong about what is
    practical to guarantee as atomic.

    Even if an operation can practically be guaranteed, it may not
    be worthwhile to provide an interface that allows requesting
    such a guaranteed atomic operation.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jun 2 19:36:06 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 6/1/26 9:27 PM, MitchAlsup wrote:
    [snip]
    But note: XADD [...] never causes more than necessary bus traffic

    I am skeptical that this is Architecturally guaranteed. It may
    fall out of any even semi-sane implementation, in which case
    programmers might be willing to take it as guaranteed. Yet I
    suspect "sanity" may not be reliable with changing tradeoffs
    (including whether protecting a company's reputation has value).

    The core is going to package this instruction up and ship it
    across the interconnect as a fire-and-forget transaction.

    The interconnect is going to route the package towards either a
    cache having write permission or a control register.

    The cache or control register will perform the packaged calculation
    and optionally send back the previous value.

    The core receives the optional previous value and the memory-atomic
    is complete:: 2 interconnect messages, both smaller than a cache line,
    not cache lines are moved, and the calculation cannot fail. The only
    failure mode is if the interconnect message fails ECC check in either directions.

    and as an atomic event, never fails, never needs retry, ...

    I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
    etc.) could provide such guarantees,

    If so, you will be surprised when you implement one.

    even extending to multiple
    contiguous instructions operating on data within an aligned
    64-byte region.

    Where it becomes cubically harder.

    Interestingly, it seems that IBM's z17 is the last
    implementation to support constrained transactions. I do wonder
    why this feature has been removed from the Architecture.

    SW TM wants the TM model to support an unbounded number of memory
    elements in the single transaction. HW does not do unbounded.

    Constrained transactions had these restrictions (from https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
    | - The transaction executes no more than 32 instructions.
    I used a timer--to the same ends.
    | - All instructions within the transaction must be within 256
    | contiguous bytes of storage.
    I allow calls to subroutines in the event.
    | - The only branches you may use are relative branches that
    | branch forward (so there can be no loops).
    Loops are OK as long as the timer does not go off.
    | - All SS and SSE-format instructions may not be used.
    Agreed.
    | - Additional general instructions may not be used.
    I see no reason to limit general calculations and memory access.
    | - The transaction's storage operands may not access more than
    | four octowords.
    8 cache lines participate, an unbounded number of cache lines
    can be accessed as long as participants is no larger than 8.
    | - The transaction may not access storage operands in any 4 |K-
    | byte blocks that contain the 256 bytes of storage beginning
    | with the TBEGINC instruction.
    interdesting.
    | - Operand references must be within a single doubleword,
    | except for some of the "multiple" instructions for which the
    | limitation is a single octoword.
    Any normal memory references to the participating lines.

    I think I read that the first implementation made an optimistic
    attempt and later rCo I do not remember if multiple optimistic
    attempts were made rCo a hardware lock was used. Perhaps four
    addresses cause too much of a slowdown when there is conflict???

    I believe that guaranteeing completion would be substantially
    easier with only one aligned 64-byte region. (As I think I
    wrote before, adding a single "word" exportable atomic operation
    in a different "cache block" _might_ be practical to implement
    though I did not have an idea for software would express such.
    I may be wrong that appending such an exportable operation would
    not make ensuring completion significantly more difficult.)

    If you take the necessary 6 months to slug through all issues
    you can find solutions for the disjoint participants to be at
    least as large as the outstanding Miss Buffer size (or MB-1).

    I think such guaranteed atomic sequences would require a
    distinct instruction not only to allow what IBM did (making such
    an illegal/faulting instruction) but also to fault when the
    instruction is misused since no fallback path is provided.

    If you do it right, your architecture sets up failure paths,
    so that if failure happens, IP reverts to the failure point
    without executing a branch instruction. I have an instruction
    that samples 'interference' and changes the failure point as
    a necessary addition. Any interrupt or exception transfers
    control to failure point before performing exception control
    transfer.

    There also seem to be other operations that would not (I think)
    be exceptionally difficult to guarantee. E.g., swapping cache
    blocks might not be much more difficult to guarantee than quick
    operations within a single cache block, though I do not know
    how useful such an unconditional swap would be. Atomic cache
    block copy would seem to be easier (it is similar to a block
    zeroing instruction except that the value is taken from a block
    that is not writeable by other agents being in exclusive or
    shared state). Guaranteeing atomicity for a copy into a cache
    block (where two contiguous cache blocks might be in the read
    set and the write is only to part of a cache block) seems a
    little more complicated.

    The thing that makes this so difficult is that most -|Architectures
    cannot guarantee that 2 cache lines are ever simultaneously present
    in the cache. ASF and ESM have means to do this which greatly
    strengthens the guarantee of forward progress.

    My 66000 includes priority in memory transactions, and this enables
    the cache with write permission to determine to allow the request
    or to fail the request (request is at equal or lower priority) thus
    allowing the higher priority ATOMIC event to make forward progress
    at the expense of the lower priority event.

    At certain times the core may be in a position where it can finish
    an event if the cache lines can e guaranteed. During this period,
    a core can NaK a request so that the event is guaranteed to finish.

    With conventional cache coherence, partial writes seem likely to
    be complex. If masked cache block updates were possible as an
    exportable atomic operation, it might be practical to lock (NAK-
    guard) a limited read set and push the update to the owner. I do
    not know if such an update independent of previous values in the
    written cache block would be useful.

    It is much worse than that in practice. The interconnect protocol and
    the cache coherence model HAVE to HAVE ATOMIC event forward progress
    fully integrated. MESI and MOESI are insufficient here; most directory coherence protocols are also insufficient.

    I am certainly not comfortable thinking about the visibility/
    ordering constraints, so my guesses are very wrong about what is
    practical to guarantee as atomic.

    See Lamport...

    Even if an operation can practically be guaranteed, it may not
    be worthwhile to provide an interface that allows requesting
    such a guaranteed atomic operation.

    ...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Jun 2 13:52:39 2026
    From Newsgroup: comp.arch

    On 6/1/2026 6:38 PM, MitchAlsup wrote:

    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/27/26 5:08 PM, Chris M. Thomasson wrote:
    On 5/20/2026 4:47 PM, Paul Clayton wrote:
    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the
    bus lock and still make forward progress... Sigh... A
    horrible LL/SC thing can live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a
    quality of implementation issue.

    Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
    guarantees. Using LL/SC to emulate them is a different story.

    Academic LL/SC: I can agree with this statement. But neither ASF nor
    ESM has problems making stronger guarantees--and I did this over
    {7 ASF, 8 ESM} cache lines not 1 single memory location. These aslo
    impose limitation on instruction order and SW has to understand
    several nonVoneumann properties of the ATOMIC event.

    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    That standard academic stuff cannot, does not mean it absolutely
    cannot be done.

    IBM's constrained
    transactions guaranteed success of a transaction if it met
    certain criteria. A single-instruction LL/SC body could be
    Architecturally guaranteed to perform not only successfully but
    with some performance characteristics.

    A guarantee of forward progress is not very useful if the
    progress is glacially (or cosmologically) slow. ("We guarantee
    that the operation will complete before the heat death of the
    universe"ry|)

    A _guarantee_ of forward progress is ALWAYS important? Sorry for
    shouting. Shit. Knowing the size of the reservation granule is
    hyper important to help the software pad and align to remove any
    false sharing on said granule. No? But...

    I disagree. A guarantee that has a time scale beyond human
    civilization much less the lifetime of the hardware seems to
    have extremely little use. It may be reasonable to assume
    reasonable timescales for such guarantees, but a simple
    guarantee of eventual completion (if the system is kept
    operating) might be given if the profit motive seems sufficient.

    (I am not certain if even x86 XLOCK operations are absolutely
    guaranteed to complete in the presence of context switches. A
    hardware thread might be always be interrupted while it is
    performing the operation and if the hardware does not delay
    interrupt handling until after the operation completes, then the
    operation may never complete. This may be so extraordinarily
    improbable that an undetected error in ECC-protected memory
    might be more likely, in which case it is not really important.)

    I think one really wants the time scale explicitly declared as
    well as information about the range of latency and causes. Even
    5ms latency can seem like forever.

    Here's the deeper problem can rear its ugly head... Vendors
    often don't document it? Or they document it inconsistently
    across revisions? So even if you do everything right in
    principle, you're tuning against a number you had to dig out of
    a forum post or reverse engineer yourself. Scary! ;^o

    Ugh!

    Architecting a lot of such factors might help with documentation
    as Architecture is more stable than microarchitecture, but I do
    not think typical companies have the incentives for excellence
    in documentation. If the only consequence of mistakes in
    Architectural documentation is a few software developers
    grumbling, keeping even such stable documentation consistent and
    correct (and abiding by the old/existing Architectural contract)
    seems unlikely to seem important. In fact, if the inability to
    optimize forces people to buy more (or more expensive) hardware,
    poor documentation can mean higher profits.

    It took me more than 35 years to learn how to write -|Architecture
    documents such that a malevolent engineer could not misunderstand
    what was written and specified. Try it, it is not easy. It is not
    something that can be taught, but it is something that diligence
    and perseverance can deliver.

    Of course, the temptation toward "good enough" (not so bad
    that one will lose too many customers) is a problem. I would
    expect
    documented guarantees of sufficient generality to have the
    cognitive load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.

    How many SC failures on a fetch-and-add are acceptable before
    you conclude something's fundamentally broken? For me the answer
    is: very few.

    How many SC failures are acceptable if there are 1024 cores all
    going after the same lock ??

    Again, I think this is concerned with "quality of
    implementation" (and Architectural guarantees about such) than
    about the interface at an instruction level.

    Simple... Do NOT allow 1024 cores to hammer a single location!

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Jun 2 14:20:44 2026
    From Newsgroup: comp.arch

    On 6/2/2026 2:15 PM, Chris M. Thomasson wrote:
    On 6/2/2026 12:36 PM, MitchAlsup wrote:

    Paul Clayton <paaronclayton@gmail.com> posted:

    On 6/1/26 9:27 PM, MitchAlsup wrote:
    [snip]
    But note: XADD [...] never causes more than necessary bus traffic

    I am skeptical that this is Architecturally guaranteed. It may
    fall out of any even semi-sane implementation, in which case
    programmers might be willing to take it as guaranteed. Yet I
    suspect "sanity" may not be reliable with changing tradeoffs
    (including whether protecting a company's reputation has value).

    The core is going to package this instruction up and ship it
    across the interconnect as a fire-and-forget transaction.

    The interconnect is going to route the package towards either a
    cache having write permission or a control register.

    The cache or control register will perform the packaged calculation
    and optionally send back the previous value.

    The core receives the optional previous value and the memory-atomic
    is complete:: 2 interconnect messages, both smaller than a cache line,
    not cache lines are moved, and the calculation cannot fail. The only
    failure mode is if the interconnect message fails ECC check in either
    directions.
    and as an atomic event, never fails, never needs retry, ...

    I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
    etc.) could provide such guarantees,

    If so, you will be surprised when you implement one.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a even extending to multiple
    contiguous instructions operating on data within an aligned
    64-byte region.

    Where it becomes cubically harder.
    Interestingly, it seems that IBM's z17 is the last
    implementation to support constrained transactions. I do wonder
    why this feature has been removed from the Architecture.

    SW TM wants the TM model to support an unbounded number of memory
    elements in the single transaction. HW does not do unbounded.

    Constrained transactions had these restrictions (from
    https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-
    transactions):
    | - The transaction executes no more than 32 instructions.
    I used a timer--to the same ends.
    | - All instructions within the transaction must be within 256
    |-a-a contiguous bytes of storage.
    I allow calls to subroutines in the event.
    | - The only branches you may use are relative branches that
    |-a-a branch forward (so there can be no loops).
    Loops are OK as long as the timer does not go off.
    | - All SS and SSE-format instructions may not be used.
    Agreed.
    | --a Additional general instructions may not be used.
    I see no reason to limit general calculations and memory access.
    | - The transaction's storage operands may not access more than
    |-a-a four octowords.
    8 cache lines participate, an unbounded number of cache lines
    can be accessed as long as participants is no larger than 8.
    | - The transaction may not access storage operands in any 4 |K-
    |-a-a byte blocks that contain the 256 bytes of storage beginning
    |-a-a with the TBEGINC instruction.
    interdesting.
    | - Operand references must be within a single doubleword,
    |-a-a except for some of the "multiple" instructions for which the
    |-a-a limitation is a single octoword.
    Any normal memory references to the participating lines.

    I think I read that the first implementation made an optimistic
    attempt and later rCo I do not remember if multiple optimistic
    attempts were made rCo a hardware lock was used. Perhaps four
    addresses cause too much of a slowdown when there is conflict???

    I believe that guaranteeing completion would be substantially
    easier with only one aligned 64-byte region. (As I think I
    wrote before, adding a single "word" exportable atomic operation
    in a different "cache block" _might_ be practical to implement
    though I did not have an idea for software would express such.
    I may be wrong that appending such an exportable operation would
    not make ensuring completion significantly more difficult.)

    If you take the necessary 6 months to slug through all issues
    you can find solutions for the disjoint participants to be at
    least as large as the outstanding Miss Buffer size (or MB-1).
    I think such guaranteed atomic sequences would require a
    distinct instruction not only to allow what IBM did (making such
    an illegal/faulting instruction) but also to fault when the
    instruction is misused since no fallback path is provided.

    If you do it right, your architecture sets up failure paths,
    so that if failure happens, IP reverts to the failure point
    without executing a branch instruction. I have an instruction
    that samples 'interference' and changes the failure point as
    a necessary addition. Any interrupt or exception transfers
    control to failure point before performing exception control
    transfer.
    There also seem to be other operations that would not (I think)
    be exceptionally difficult to guarantee. E.g., swapping cache
    blocks might not be much more difficult to guarantee than quick
    operations within a single cache block, though I do not know
    how useful such an unconditional swap would be. Atomic cache
    block copy would seem to be easier (it is similar to a block
    zeroing instruction except that the value is taken from a block
    that is not writeable by other agents being in exclusive or
    shared state). Guaranteeing atomicity for a copy into a cache
    block (where two contiguous cache blocks might be in the read
    set and the write is only to part of a cache block) seems a
    little more complicated.

    The thing that makes this so difficult is that most -|Architectures
    cannot guarantee that 2 cache lines are ever simultaneously present
    in the cache. ASF and ESM have means to do this which greatly
    strengthens the guarantee of forward progress.

    My 66000 includes priority in memory transactions, and this enables
    the cache with write permission to determine to allow the request
    or to fail the request (request is at equal or lower priority) thus
    allowing the higher priority ATOMIC event to make forward progress
    at the expense of the lower priority event.

    At certain times the core may be in a position where it can finish
    an event if the cache lines can e guaranteed. During this period,
    a core can NaK a request so that the event is guaranteed to finish.
    With conventional cache coherence, partial writes seem likely to
    be complex. If masked cache block updates were possible as an
    exportable atomic operation, it might be practical to lock (NAK-
    guard) a limited read set and push the update to the owner. I do
    not know if such an update independent of previous values in the
    written cache block would be useful.

    It is much worse than that in practice. The interconnect protocol and
    the cache coherence model HAVE to HAVE ATOMIC event forward progress
    fully integrated. MESI and MOESI are insufficient here; most directory
    coherence protocols are also insufficient.
    I am certainly not comfortable thinking about the visibility/
    ordering constraints, so my guesses are very wrong about what is
    practical to guarantee as atomic.

    See Lamport...
    Even if an operation can practically be guaranteed, it may not
    be worthwhile to provide an interface that allows requesting
    such a guaranteed atomic operation.

    ...

    Well, we can do something... we know that lock cmpxchg8b on a 32 bit
    system can handle two adjacent cache lines. So, we can try to hold more
    than that, but! its not ideal. For instance my multex can do it and
    emulate it. Read all https://groups.google.com/g/comp.lang.c++/c/ sV4WC_cBb9Q/m/SkSqpSxGCAAJ


    I think that is why AMD allowed for LOCK RMW along with LL/SC?!
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Andy Valencia@vandys@vsta.org to comp.arch on Tue Jun 2 17:11:11 2026
    From Newsgroup: comp.arch

    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    Now, there was no thought of hundreds (or thousands) of CPU's. But
    some of the pessimistic assumptions you might make of LL/SC (at least
    as available in MIPS CPU's of that era) might need to be
    revisited. Our best analysis said it would scale to very large
    (for that time) database workloads.

    Finances and other management things cancelled the program. Sequent
    eventually went with their NUMA, ultimately being acquired by IBM. We
    never found out how that system would've done in the real world.

    I seem to remember its code name was "Model R" (RISC).

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jun 3 18:19:28 2026
    From Newsgroup: comp.arch

    Paul Clayton <paaronclayton@gmail.com> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    Let's see:

    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Jun 3 12:57:42 2026
    From Newsgroup: comp.arch

    On 6/3/2026 11:19 AM, Anton Ertl wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    Let's see:

    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    Hammering a single location is going to be bad for LL/SC or LOCK RMW, regardless of the ins and outs of LL/SC vs LOCK RMW. Its up to the
    programmer to make sure that is amortized, distributed in clever ways.
    For instance, why use a single atomic counter, vs say using a per thread counter and summing them when we need to observe the actual count?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jun 3 20:53:49 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/3/2026 11:19 AM, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
    (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    Hammering a single location is going to be bad for LL/SC or LOCK RMW, >regardless of the ins and outs of LL/SC vs LOCK RMW.

    It's two locations in these benchmarks: X and Y.

    Its up to the
    programmer to make sure that is amortized, distributed in clever ways.
    For instance, why use a single atomic counter, vs say using a per thread >counter and summing them when we need to observe the actual count?

    These benchmarks use per-thread storage: They are single-threaded.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Jun 3 15:15:53 2026
    From Newsgroup: comp.arch

    On 6/3/2026 1:53 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/3/2026 11:19 AM, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
    (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    Hammering a single location is going to be bad for LL/SC or LOCK RMW,
    regardless of the ins and outs of LL/SC vs LOCK RMW.

    It's two locations in these benchmarks: X and Y.

    Its up to the
    programmer to make sure that is amortized, distributed in clever ways.
    For instance, why use a single atomic counter, vs say using a per thread
    counter and summing them when we need to observe the actual count?

    These benchmarks use per-thread storage: They are single-threaded.

    Humm... I missed that. Anyway, you need to test them multi threaded...
    Say our counters are per thread so an increment adds to its per-thread
    counter instead of using a LOCK RMW. Then when the counter needs to be
    sampled we can start summing up the per thread counts...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Jun 4 14:21:16 2026
    From Newsgroup: comp.arch

    Andy Valencia <vandys@vsta.org> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe; we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110. So we investigated
    MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
    SPP. After evaluation, we chose Pentium Pro to build the system
    (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were concerned
    at the time about the scalability of LL/SC. SPARC never made it out
    of the first evaluation round.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Jun 4 10:23:36 2026
    From Newsgroup: comp.arch

    On 2026-Jun-03 14:19, Anton Ertl wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    Let's see:

    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    - anton

    On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
    whether it is specified or not. In your example both are atomic.
    CMPXCHG does not do this - to be atomic it must have a LOCK prefix.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Jun 4 10:25:06 2026
    From Newsgroup: comp.arch

    On 2026-Jun-03 16:53, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/3/2026 11:19 AM, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
    (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    Hammering a single location is going to be bad for LL/SC or LOCK RMW,
    regardless of the ins and outs of LL/SC vs LOCK RMW.

    It's two locations in these benchmarks: X and Y.

    Its up to the
    programmer to make sure that is amortized, distributed in clever ways.
    For instance, why use a single atomic counter, vs say using a per thread
    counter and summing them when we need to observe the actual count?

    These benchmarks use per-thread storage: They are single-threaded.

    - anton

    They might be allocated in the same cache line.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Jun 4 21:04:28 2026
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    On 2026-Jun-03 14:19, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
    ...
    On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
    whether it is specified or not. In your example both are atomic.

    The code for "x !@" is:

    mov 0x8(%rbx),%r15
    mov %r13,%rax
    mov (%r15),%r13
    mov %rax,(%r15)

    while the code for "x atomic!@" is:

    mov %r13,(%r10)
    sub $0x8,%r10
    mov 0x8(%rbx),%r13
    mov 0x8(%r10),%rax
    add $0x8,%r10
    xchg %rax,0x0(%r13)
    mov %rax,%r13

    As you can see, there is no XCHG in the !@ code.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Jun 4 18:28:43 2026
    From Newsgroup: comp.arch

    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <vandys@vsta.org> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe; we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110. So we investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
    SPP. After evaluation, we chose Pentium Pro to build the system
    (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were concerned
    at the time about the scalability of LL/SC. SPARC never made it out
    of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was pretty
    fast for certain worksets and algorithms. RMO mode.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Jun 4 18:33:41 2026
    From Newsgroup: comp.arch

    On 6/4/2026 2:04 PM, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    On 2026-Jun-03 14:19, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
    ...
    On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
    whether it is specified or not. In your example both are atomic.

    The code for "x !@" is:

    mov 0x8(%rbx),%r15
    mov %r13,%rax
    mov (%r15),%r13
    mov %rax,(%r15)

    while the code for "x atomic!@" is:

    mov %r13,(%r10)
    sub $0x8,%r10
    mov 0x8(%rbx),%r13
    mov 0x8(%r10),%rax
    add $0x8,%r10
    xchg %rax,0x0(%r13)
    mov %rax,%r13

    As you can see, there is no XCHG in the !@ code.

    How is your data organized? Show me the structure?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Jun 4 21:20:20 2026
    From Newsgroup: comp.arch

    On 6/4/2026 2:04 PM, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    On 2026-Jun-03 14:19, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
    ...
    On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
    whether it is specified or not. In your example both are atomic.

    The code for "x !@" is:

    mov 0x8(%rbx),%r15
    mov %r13,%rax
    mov (%r15),%r13
    mov %rax,(%r15)

    while the code for "x atomic!@" is:

    mov %r13,(%r10)
    sub $0x8,%r10
    mov 0x8(%rbx),%r13
    mov 0x8(%r10),%rax
    add $0x8,%r10
    xchg %rax,0x0(%r13)
    mov %rax,%r13

    As you can see, there is no XCHG in the !@ code.

    XCHG does have the implied LOCK as EricP mentioned.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Jun 4 22:56:47 2026
    From Newsgroup: comp.arch

    On 6/4/2026 6:33 PM, Chris M. Thomasson wrote:
    On 6/4/2026 2:04 PM, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    On 2026-Jun-03 14:19, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    -a-a-a-a-a 1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    -a-a-a-a-a 1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
    ...
    On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
    whether it is specified or not. In your example both are atomic.

    The code for "x !@" is:

    mov-a-a-a 0x8(%rbx),%r15
    mov-a-a-a %r13,%rax
    mov-a-a-a (%r15),%r13
    mov-a-a-a %rax,(%r15)

    while the code for "x atomic!@" is:

    mov-a-a-a %r13,(%r10)
    sub-a-a-a $0x8,%r10
    mov-a-a-a 0x8(%rbx),%r13
    mov-a-a-a 0x8(%r10),%rax
    add-a-a-a $0x8,%r10
    xchg-a-a %rax,0x0(%r13)
    mov-a-a-a %rax,%r13

    As you can see, there is no XCHG in the !@ code.

    How is your data organized? Show me the structure?

    // padded to a l2 cache line
    struct A
    {
    unsigned word m_data;
    char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    unsigned word m_data;
    char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need
    to pad _and_ align...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 5 09:04:51 2026
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    I don't recall the details of the MIPS evaluation, but we were concerned
    at the time about the scalability of LL/SC.

    I remember listening to a presentation by a student of a collegue
    about implementing garbage collection for IIRC big SGI machines. In
    addition to LL/SC, they had atomic stuff stuch as fetch-and-add
    implemented in the memory subsystem, not in the processor, and that
    apparently was needed for contended cases to avoid the round-trip time
    through the caches of individual processors. My understanding is
    that, while viewed from the perspective of an individual core, the
    atomic instructions were slow, the throughput in the contended case
    was significantly higher than with LL/SC or an atomic mechanism
    implemented in the individual CPUs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 5 09:12:03 2026
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    These benchmarks use per-thread storage: They are single-threaded.
    ...
    They might be allocated in the same cache line.

    Given that they are accessed by the same thread, I don't expect that
    to hurt, but I did separate the variables by at least 64 bytes in my
    recent runs just in case.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 5 09:14:29 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/4/2026 2:04 PM, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    On 2026-Jun-03 14:19, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !
    ...
    How is your data organized? Show me the structure?

    Shown above. Or, in today's testing:

    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 5 10:20:30 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    // padded to a l2 cache line
    struct A
    {
    unsigned word m_data;
    char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    unsigned word m_data;
    char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need
    to pad _and_ align...

    Why would alignment to cache-line boundaries be necessary?

    Anyway, let's see if it makes a difference.

    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today). A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).

    F) Word at offset 8 from a cache-line start, 48 bytes padding, another
    word (cache-line-aligned).

    And here are the results (on a Ryzen 8700G):

    The cycles per execution of the relevant word for the
    no-atomic/no-barrier variants are:

    !@ +!@ barr
    2.4 2.4 1.8 A B C
    2.4 2.4 1.9 D E

    For the atomic/barrier variants the cycles are:

    !@ +!@ barr
    9.3 8.3 7.2 A
    9.2 8.3 7.1 B
    9.2 8.3 8.5-11.2 C
    9.3 8.3 9.1-11 D
    9.1 8.3 7.3-11 E

    The variatons for the barrier column are small for A and B (in the
    range 6.9-7.2), and quite a bit larger for C-E, and I have no
    explanation for that. The other columns show only small variations.
    In any case the aligning and padding recommended by you is not
    superior to the original code, which just uses two variables.

    Here's the code:

    1 [if]
    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    [else]
    : cache-align here dup 64 naligned >align ;
    cache-align
    here 1 , cache-align here -1 , constant y constant x
    [endif]

    The part before the [else] is A, comment out "64 allot" for B.

    The part after the [else] is D, delete the second CACHE-ALIGN for C,
    and replace it with "64 allot" for E.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Jun 5 13:43:11 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <vandys@vsta.org> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe; we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110. So we investigated
    MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
    SPP. After evaluation, we chose Pentium Pro to build the system
    (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were concerned
    at the time about the scalability of LL/SC. SPARC never made it out
    of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was pretty
    fast for certain worksets and algorithms. RMO mode.

    Both technical and business reasons.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Andy Valencia@vandys@vsta.org to comp.arch on Fri Jun 5 07:07:07 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    I don't recall the details of the MIPS evaluation, but we were concerned
    at the time about the scalability of LL/SC. SPARC never made it out
    of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was pretty
    fast for certain worksets and algorithms. RMO mode.

    Sun came through Cisco as well, I don't recall which generation of
    chips, but I remember their focus was on the interface to memory
    itself, targeting radically reduced latency and much higher bandwidth.
    We weren't sure they would get their design out the door, and we were
    pretty sure indeed that they wouldn't make a good enough embedded
    CPU for our purposes. Too big, too hot, too expensive, and so forth.

    At that time (MANY years ago now) Cisco's core router OS was big endian
    only. That kept us from considering x86.

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 5 15:11:22 2026
    From Newsgroup: comp.arch

    On 6/5/2026 3:20 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    // padded to a l2 cache line
    struct A
    {
    unsigned word m_data;
    char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    unsigned word m_data;
    char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need
    to pad _and_ align...

    Why would alignment to cache-line boundaries be necessary?

    Anyway, let's see if it makes a difference.

    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today). A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).

    F) Word at offset 8 from a cache-line start, 48 bytes padding, another
    word (cache-line-aligned).

    And here are the results (on a Ryzen 8700G):

    The cycles per execution of the relevant word for the
    no-atomic/no-barrier variants are:

    !@ +!@ barr
    2.4 2.4 1.8 A B C
    2.4 2.4 1.9 D E

    For the atomic/barrier variants the cycles are:

    !@ +!@ barr
    9.3 8.3 7.2 A
    9.2 8.3 7.1 B
    9.2 8.3 8.5-11.2 C
    9.3 8.3 9.1-11 D
    9.1 8.3 7.3-11 E

    The variatons for the barrier column are small for A and B (in the
    range 6.9-7.2), and quite a bit larger for C-E, and I have no
    explanation for that. The other columns show only small variations.
    In any case the aligning and padding recommended by you is not
    superior to the original code, which just uses two variables.

    Well, its mainly for false sharing in a multi threading environment. But
    it does matter a bit. If your variables straddle a cache line then it
    will trigger a bus lock. Single-threaded avoid straddling cache line boundaries to prevent bus locks on LOCK prefixed instructions
    Multi-threaded pad and align to prevent false sharing between
    independently accessed variables.

    For instance you don't want a mutex word to false share with say an
    atomic counter that has nothing to do with the mutex. They need to be
    padded and aligned...


    Here's the code:

    1 [if]
    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    [else]
    : cache-align here dup 64 naligned >align ;
    cache-align
    here 1 , cache-align here -1 , constant y constant x
    [endif]

    The part before the [else] is A, comment out "64 allot" for B.

    The part after the [else] is D, delete the second CACHE-ALIGN for C,
    and replace it with "64 allot" for E.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 5 15:27:04 2026
    From Newsgroup: comp.arch

    On 6/5/2026 12:04 AM, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Paul Clayton <paaronclayton@gmail.com> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    I have revised the benchmarks as follows: I have added a test of a
    memory barrier, which is implemented in GNU C as

    __atomic_thread_fence(__ATOMIC_SEQ_CST);

    The barriers separate loads.

    I have increased the loop count by a factor of 10, because I did not
    subtract the startup overhead of Gforth; as a result, the startup
    overhead is reduced from 3.3 cycles per execution of the relevant word
    to 0.33 cycles.

    I have also inserted 64 bytes between the variables, so that they are
    in different cache lines. This should not make a difference, because
    all accesses are in the same thread (i.e., no cache-ping-pong from
    possible false sharing), but just in case.

    What I did not do is to use several threads. The idea here is that programmers will take measures that ensure that contention is rare,
    but you still need to use atomic instructions and barriers to ensure correctness. Ideally in this case the atomic instructions and
    barriers have no extra cost, but in reality, they do have extra cost.

    Indeed.


    [snip results]


    Thanks.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 5 15:40:13 2026
    From Newsgroup: comp.arch

    On 6/5/2026 12:04 AM, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Paul Clayton <paaronclayton@gmail.com> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    I have revised the benchmarks as follows: I have added a test of a
    memory barrier, which is implemented in GNU C as

    __atomic_thread_fence(__ATOMIC_SEQ_CST);

    The barriers separate loads.
    [...]

    On x86, well, did it fall back to MFENCE? Or use a dummy LOCK RMW on a
    per thread stack location? Iirc some compilers would use a dummy. Oh
    shit man, 20+ish years ago I was running all sorts of benchmarks on
    MFENCE vs LOCK RMW. Or MFENCE vs MEMBAR #StoreLoad | #LoadStore |
    #StoreStore | #LoadLoad on the SPARC. I could not really directly test
    LOCK RMW wrt x86 on the SPARC because all of the sparcs aromic RMW's are naked. I would have to manually add the barriers to make it TSO in RMO mode. --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 5 15:43:14 2026
    From Newsgroup: comp.arch

    On 6/5/2026 3:11 PM, Chris M. Thomasson wrote:
    On 6/5/2026 3:20 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    // padded to a l2 cache line
    struct A
    {
    -a-a-a-a unsigned word m_data;
    -a-a-a-a char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    -a-a-a-a unsigned word m_data;
    -a-a-a-a char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need
    to pad _and_ align...

    Why would alignment to cache-line boundaries be necessary?

    Anyway, let's see if it makes a difference.

    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today).-a A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).

    F) Word at offset 8 from a cache-line start, 48 bytes padding, another
    word (cache-line-aligned).

    And here are the results (on a Ryzen 8700G):

    The cycles per execution of the relevant word for the
    no-atomic/no-barrier variants are:

    -a-a !@-a-a +!@ barr
    -a-a 2.4-a 2.4-a 1.8 A B C
    -a-a 2.4-a 2.4-a 1.9 D E

    For the atomic/barrier variants the cycles are:

    -a-a !@-a-a +!@ barr
    -a-a 9.3-a 8.3-a 7.2 A
    -a-a 9.2-a 8.3-a 7.1 B
    -a-a 9.2-a 8.3-a 8.5-11.2 C
    -a-a 9.3-a 8.3-a 9.1-11-a-a D
    -a-a 9.1-a 8.3-a 7.3-11-a-a E

    The variatons for the barrier column are small for A and B (in the
    range 6.9-7.2), and quite a bit larger for C-E, and I have no
    explanation for that.-a The other columns show only small variations.
    In any case the aligning and padding recommended by you is not
    superior to the original code, which just uses two variables.

    Well, its mainly for false sharing in a multi threading environment. But
    it does matter a bit. If your variables straddle a cache line then it
    will trigger a bus lock. Single-threaded avoid straddling cache line boundaries to prevent bus locks on LOCK prefixed instructions

    Actually try to avoid LOCK prefixed anything on single threaded... Even
    XCHG has that implied LOCK prefix. :^)



    Multi-threaded pad and align to prevent false sharing between
    independently accessed variables.

    For instance you don't want a mutex word to false share with say an
    atomic counter that has nothing to do with the mutex. They need to be
    padded and aligned...


    Here's the code:

    1 [if]
    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    [else]
    -a-a-a-a : cache-align here dup 64 naligned >align ;
    -a-a-a-a cache-align
    -a-a-a-a here 1 , cache-align here -1 , constant y constant x
    [endif]

    The part before the [else] is A, comment out "64 allot" for B.

    The part after the [else] is D, delete the second CACHE-ALIGN for C,
    and replace it with "64 allot" for E.




    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 6 08:14:17 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/5/2026 12:04 AM, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I have revised the benchmarks as follows: I have added a test of a
    memory barrier, which is implemented in GNU C as

    __atomic_thread_fence(__ATOMIC_SEQ_CST);

    The barriers separate loads.
    [...]

    On x86, well, did it fall back to MFENCE? Or use a dummy LOCK RMW on a
    per thread stack location?

    On AMD64, the latter. The code generated by gcc for the line above
    is:

    lock orq $0x0,(%rsp)

    On ARM A64 gcc generates the following:

    dmb ish

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 6 08:30:45 2026
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Anyway, let's see if it makes a difference.

    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today). A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).
    [...]
    And here are the results (on a Ryzen 8700G):

    The cycles per execution of the relevant word for the
    no-atomic/no-barrier variants are:

    !@ +!@ barr
    2.4 2.4 1.8 A B C
    2.4 2.4 1.9 D E

    For the atomic/barrier variants the cycles are:

    !@ +!@ barr
    9.3 8.3 7.2 A
    9.2 8.3 7.1 B
    9.2 8.3 8.5-11.2 C
    9.3 8.3 9.1-11 D
    9.1 8.3 7.3-11 E

    The variatons for the barrier column are small for A and B (in the
    range 6.9-7.2), and quite a bit larger for C-E, and I have no
    explanation for that.

    Now I have: It's the placement of the native code. If I compile
    another definition

    : dummy1 swap over 2rot ;

    that is never called before all the others, the result for D becomes:

    !@ +!@ barr
    9.3 8.3 7.2 D

    with little variation. So it seems that the code placement of the bench-barrier word ran into some microarchitectural hickup of Zen4.

    Now that I have that problem worked around, let's see if the data
    placement makes a difference:

    !@ +!@ barr
    9.3 8.3 7.2 A
    9.2 8.3 7.1 B
    9.3 8.3 7.0 C
    9.3 8.3 7.2 D
    9.3 8.3 7.2 E

    Making them adjacent in the same cache line is not disadvantage as
    long as there is no actual communication going on. Of course, in an
    actual application you want them in different cache lines, because
    then you will have communication, or using atomic accesses or barrier
    would not be pointless.

    Code (with the data part set up for E):

    0 [if]
    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    [else]
    : dummy1 swap over 2rot ;
    : cache-align here dup 64 naligned >align ;
    cache-align
    here 1 , ( cache-align ) 64 allot here -1 , constant y constant x
    [endif]

    : bench-!@
    1 50_000_000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 50_000_000 0 do x atomic!@ y atomic!@ loop drop ;

    : bench-+!@
    1 50_000_000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 50_000_000 0 do x atomic+!@ y atomic+!@ loop drop ;

    : bench-nobarrier
    50_000_000 0 do x @ y @ 2drop loop ;

    : bench-barrier
    50_000_000 0 do x @ barrier y @ barrier 2drop loop ;

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 6 08:49:06 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/5/2026 3:20 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    // padded to a l2 cache line
    struct A
    {
    unsigned word m_data;
    char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    unsigned word m_data;
    char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need
    to pad _and_ align...

    Why would alignment to cache-line boundaries be necessary?
    [...]
    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today). A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).

    F) Word at offset 8 from a cache-line start, 48 bytes padding, another
    word (cache-line-aligned).
    ...
    Well, its mainly for false sharing in a multi threading environment. But
    it does matter a bit. If your variables straddle a cache line then it
    will trigger a bus lock.

    All of the data placement variants use word-aligned words and thus do
    not straddle cache lines. But your claim was that one should use only
    the first word in a cache line. Avoiding false sharing is important,
    if there is any sharing, but that's not the case for this benchmark.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Jun 6 11:52:09 2026
    From Newsgroup: comp.arch

    On 6/6/2026 1:49 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/5/2026 3:20 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    // padded to a l2 cache line
    struct A
    {
    unsigned word m_data;
    char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    unsigned word m_data;
    char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need >>>> to pad _and_ align...

    Why would alignment to cache-line boundaries be necessary?
    [...]
    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today). A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).

    F) Word at offset 8 from a cache-line start, 48 bytes padding, another
    word (cache-line-aligned).
    ...
    Well, its mainly for false sharing in a multi threading environment. But
    it does matter a bit. If your variables straddle a cache line then it
    will trigger a bus lock.

    All of the data placement variants use word-aligned words and thus do
    not straddle cache lines. But your claim was that one should use only
    the first word in a cache line. Avoiding false sharing is important,
    if there is any sharing, but that's not the case for this benchmark.

    Fair enough! :^) For a single-threaded benchmark with no concurrent
    sharing, you are right. The layout variants you described ensure no
    single word straddles a cache-line boundary, which completely avoids the split-access or bus-lock penalty on a single core. In that specific
    context, packing things tightly is "superior" because my defensive
    padding would just bloat the working set and cause unnecessary cache misses.

    Fwiw, my advice to align and pad so a variable exclusively owns the
    first word of a cache line is a habit born entirely out of
    multi-threaded, lock/wait-free architecture design.

    Actually, there is a fundamental difference in intent:

    Word Alignment: Keeps a single thread from split-concurrency penalties (straddling). No word from cache line A bleeding into cache line B.

    Cache-Line Alignment + Padding: Keeps different threads on different
    cores from causing hardware cache-coherence storms (false sharing). Very
    bad!

    If struct A and struct B live in the exact same cache line, they are
    safe from straddling. But the moment Core 0 writes to A and Core 1
    writes to B, the underlying MESI cache-coherence protocol will violently bounce that single cache line back and forth between L1 caches.

    Since your benchmark doesn't have concurrent sharing, you only care
    about #1. I default to engineering for #2 defensively because the moment
    code scales out to multiple threads, a well-aligned but unpadded
    structure can cause performance to fall off a cliff.

    Actually, do you remember the thread offset fiasco from Intel? I
    remember reading a white paper wrt hyper threading, that the thread
    stacks should be offset from each other to avoid false sharing. It was a
    work around for a design error, iirc?
    --- Synchronet 3.22a-Linux NewsLink 1.2