Forum: Too Lazy BBS

Re: ARM CAS vs LL/SC

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue May 26 12:44:20 2026

From Newsgroup: comp.arch

On 5/24/2026 2:24 PM, Paul Clayton wrote:

On 5/21/26 4:17 PM, Scott Lurndal wrote:

Paul Clayton <paaronclayton@gmail.com> writes:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).

Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.

A large granule then we need to worry about a single load from say via
false sharing or something... Well, can that case the SC to fail?

FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
using a hashed lock where address of a target word is used to index into
an array. Something akin to:

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 26 20:58:52 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 5/24/2026 2:24 PM, Paul Clayton wrote:

On 5/21/26 4:17 PM, Scott Lurndal wrote:

Paul Clayton <paaronclayton@gmail.com> writes:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).

Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.

A large granule then we need to worry about a single load from say via
false sharing or something... Well, can that case the SC to fail?

Does this "LL/SC and other core instructions synchronization means" not
fall from "desirable" when one has a complete set of to-memory() atomic
actions {add, sub, and, or, xor, xchg, cmp, cas} which avoid all the
quadratic and cubic interconnect traffic in the system which are the
real point of slow synchronization ??!!?? while being guaranteed to
work without an interference and can be done for both cacheable and
unCacheable memory accesses ??!!??

FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
using a hashed lock where address of a target word is used to index into
an array. Something akin to:

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue May 26 14:00:36 2026

From Newsgroup: comp.arch

On 5/26/2026 1:58 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 5/24/2026 2:24 PM, Paul Clayton wrote:

On 5/21/26 4:17 PM, Scott Lurndal wrote:

Paul Clayton <paaronclayton@gmail.com> writes:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).

Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.

A large granule then we need to worry about a single load from say via
false sharing or something... Well, can that case the SC to fail?

Does this "LL/SC and other core instructions synchronization means" not
fall from "desirable" when one has a complete set of to-memory() atomic actions {add, sub, and, or, xor, xchg, cmp, cas} which avoid all the quadratic and cubic interconnect traffic in the system which are the
real point of slow synchronization ??!!?? while being guaranteed to
work without an interference and can be done for both cacheable and unCacheable memory accesses ??!!??

Take a look some S/HTM... A single load can cause a retry, and lead to
live lock?

FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
using a hashed lock where address of a target word is used to index into
an array. Something akin to:

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 27 14:25:19 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> writes:

On 5/21/26 4:17 PM, Scott Lurndal wrote:

Paul Clayton <paaronclayton@gmail.com> writes:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).

Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.

The less specifically the size is defined, the less performance-
portable software becomes. One can address this with something
like RISC-V profiles, in which sizes can be more specific and
software that cares will specify a target profile rather than an
Architecture (version).

Since granule size can influence what code is most efficient,
even recompiling is not an excellent option. So for a class of
applications, having a single target seems to make sense.

Being able to test software on a development machine can also be
useful, so desired performance compatibility might be broader
than a application type.

I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for
small LL/SC code bodies, single words for conflicts with other
atomic operations rCo normal loads and stores might still use
cache block granularity to limit complexity and/or network
overhead).

It would be limiting to tie LL/SC to cache lines.

It is not tying the operation to cache lines but to cache
line granules in terms of external interference monitoring
(and, in the case of a modest extension beyond traditional
LL/SC, the scope of the read/write set).

Atomics are independent of the cache, and can be used with
both cacheable and non-cacheable memory as well as
CXL and PCI Express devices.

I am not certain that LL/SC (or an extended form of such)
could not be used with "I/O" addresses. This merely requires
the equivalent of one cache line "cache" (or the largest
guaranteed size of a transaction) and some form of
monitoring ("coherence") of such memory addresses.

In the case of a simple operation, as has been stated before,
the LL/SC sequence can be converted to the equivalent of an
atomic instruction.

If true in the general case (and I'm not sure I see how it
can be), why bother to add the hardware to do so when
atomics are generally superior, scalable, simpler to implement and
higher performance?

For other operations, I am not certain what semantics make
sense. If a read at one address changes the behavior of another
access, does "atomic" behavior mean that the later in program
order access happens before the I/O agent changes the access
behavior or does it mean that the atomic action blocks "ordinary
software agents" but lets side effects caused by the action to
occur in program order?

Atomics ensure that the access is atomic with respect to
all other accessors - ensuring that the other accessors
will not see inconsistent data.

Atomics can be used as a basis (e.g. atomic test&set) to
guard a critical section, but they're also useful for
adjusting shared counters et alia.

My perception is that PCI-E atomics are not meant for
non-idempotent storage. (I do not know how ARM atomic
instructions handle such cases.

See above.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed May 27 14:08:17 2026

From Newsgroup: comp.arch

On 5/20/2026 4:47 PM, Paul Clayton wrote:

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the bus lock
and still make forward progress... Sigh... A horrible LL/SC thing can
live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a quality of implementation issue.

Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.

A guarantee of forward progress is not very useful if the progress is glacially (or cosmologically) slow. ("We guarantee that the operation
will complete before the heat death of the universe"ry|)

A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is hyper
important to help the software pad and align to remove any false sharing
on said granule. No? But...

Here's the deeper problem can rear its ugly head... Vendors often don't document it? Or they document it inconsistently across revisions? So
even if you do everything right in principle, you're tuning against a
number you had to dig out of a forum post or reverse engineer yourself.
Scary! ;^o

Of course, the temptation toward "good enough" (not so bad that one will lose too many customers) is a problem. I would expect
documented guarantees of sufficient generality to have the cognitive
load for software developers be acceptable. That
such guarantees seem to be very rare is sad.

How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very few.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed May 27 14:14:11 2026

From Newsgroup: comp.arch

On 5/27/2026 2:08 PM, Chris M. Thomasson wrote:
[...]

How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very few.

A LOCK XADD can be used for wait free algos, a LOCK XADD emulated with
LL/SC cannot... ?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed May 27 14:24:36 2026

From Newsgroup: comp.arch

On 5/27/2026 2:14 PM, Chris M. Thomasson wrote:

On 5/27/2026 2:08 PM, Chris M. Thomasson wrote:
[...]

How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very
few.

A LOCK XADD can be used for wait free algos, a LOCK XADD emulated with
LL/SC cannot... ?

For x86, its "easier" for sure... pad _and_ align on a l2 cache line,
and you should be ideal... SO NO straddle a cache line and execute a
damn LOCK RMW on it. Bus lock for sure.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 28 01:27:36 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 5/20/2026 4:47 PM, Paul Clayton wrote:

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the bus lock
and still make forward progress... Sigh... A horrible LL/SC thing can
live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a quality of implementation issue.

Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.

A guarantee of forward progress is not very useful if the progress is glacially (or cosmologically) slow. ("We guarantee that the operation
will complete before the heat death of the universe"ry|)

A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is hyper important to help the software pad and align to remove any false sharing
on said granule. No? But...

Here's the deeper problem can rear its ugly head... Vendors often don't document it? Or they document it inconsistently across revisions? So
even if you do everything right in principle, you're tuning against a
number you had to dig out of a forum post or reverse engineer yourself. Scary! ;^o

Of course, the temptation toward "good enough" (not so bad that one will lose too many customers) is a problem. I would expect
documented guarantees of sufficient generality to have the cognitive
load for software developers be acceptable. That
such guarantees seem to be very rare is sad.

How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very few.

Following a "SC failure" My 66000 provides a readable control register
called 'WHY' which contains a number. Negative numbers represent kinds
of failures {resource limit exceeded, time out, ...} while positive
values indicate how far back in-line your request is (measured by a
resource which has unique system-wide visibility to ATOMIC-order}.

Thus, SW can use WHY to reach deeper into the Queue of pending work and
select a unit that nobody else is going to go after on the next iteration.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun May 31 21:32:14 2026

From Newsgroup: comp.arch

On 5/27/26 5:08 PM, Chris M. Thomasson wrote:

On 5/20/2026 4:47 PM, Paul Clayton wrote:

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the
bus lock and still make forward progress... Sigh... A
horrible LL/SC thing can live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a
quality of implementation issue.

Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations. IBM's constrained
transactions guaranteed success of a transaction if it met
certain criteria. A single-instruction LL/SC body could be
Architecturally guaranteed to perform not only successfully but
with some performance characteristics.

A guarantee of forward progress is not very useful if the
progress is glacially (or cosmologically) slow. ("We guarantee
that the operation will complete before the heat death of the
universe"ry|)

A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is
hyper important to help the software pad and align to remove any
false sharing on said granule. No? But...

I disagree. A guarantee that has a time scale beyond human
civilization much less the lifetime of the hardware seems to
have extremely little use. It may be reasonable to assume
reasonable timescales for such guarantees, but a simple
guarantee of eventual completion (if the system is kept
operating) might be given if the profit motive seems sufficient.

(I am not certain if even x86 XLOCK operations are absolutely
guaranteed to complete in the presence of context switches. A
hardware thread might be always be interrupted while it is
performing the operation and if the hardware does not delay
interrupt handling until after the operation completes, then the
operation may never complete. This may be so extraordinarily
improbable that an undetected error in ECC-protected memory
might be more likely, in which case it is not really important.)

I think one really wants the time scale explicitly declared as
well as information about the range of latency and causes. Even
5ms latency can seem like forever.

Here's the deeper problem can rear its ugly head... Vendors
often don't document it? Or they document it inconsistently
across revisions? So even if you do everything right in
principle, you're tuning against a number you had to dig out of
a forum post or reverse engineer yourself. Scary! ;^o

Ugh!

Architecting a lot of such factors might help with documentation
as Architecture is more stable than microarchitecture, but I do
not think typical companies have the incentives for excellence
in documentation. If the only consequence of mistakes in
Architectural documentation is a few software developers
grumbling, keeping even such stable documentation consistent and
correct (and abiding by the old/existing Architectural contract)
seems unlikely to seem important. In fact, if the inability to
optimize forces people to buy more (or more expensive) hardware,
poor documentation can mean higher profits.

Of course, the temptation toward "good enough" (not so bad
that one will lose too many customers) is a problem. I would
expect
documented guarantees of sufficient generality to have the
cognitive load for software developers be acceptable. That
such guarantees seem to be very rare is sad.

How many SC failures on a fetch-and-add are acceptable before
you conclude something's fundamentally broken? For me the answer
is: very few.

Again, I think this is concerned with "quality of
implementation" (and Architectural guarantees about such) than
about the interface at an instruction level.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun May 31 23:26:39 2026

From Newsgroup: comp.arch

On 5/27/26 10:25 AM, Scott Lurndal wrote:

Paul Clayton <paaronclayton@gmail.com> writes:

[snip]

In the case of a simple operation, as has been stated before,
the LL/SC sequence can be converted to the equivalent of an
atomic instruction.

If true in the general case (and I'm not sure I see how it
can be), why bother to add the hardware to do so when
atomics are generally superior, scalable, simpler to implement and
higher performance?

A more generic interface has some advantages.

I already mentioned that old software that was developed when
there was not an atomic ["expensive" operation] instruction
could benefit from idiom recognition on new hardware. (An
alternative to that would be patching or recompiling the
software. While I prefer a more abstract software distribution
format for its ability to avoid having to move things to
Architecture and even potentially perform microarchitectural
optimizations at non-instruction granularity, such seems
unlikely to be common any time soon.)

Even with atomic instructions, the Architecture generally does
not provide guarantees about scalability. I doubt any
implementation would stop-the-world to perform an atomic
operation (because the performance penalty would be quite
noticeable), but I can easily imagine an implementation
waiting until the atomic operation is not speculative before
starting it.

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized. (System calls
have similar excessive, in my opinion, latency. Some of this may
be from cruft, but I received the impression that optimization
effort is a significant cause for the higher latency.)

I do not like the code bloat and decode complexity of using
LL/SC for simple atomic operations. Unfortunately, even a LL-and-SC-after-next-compute instruction (which would allow
arbitrary single compute instruction atomics and might be
extended by function call instructions to microcode) would have
the bloat of redundant register name encoding. Even a diversity
of addressing modes may be excessive for atomic operations, if
simple register-indirect with no offset is sufficiently common.

With destructive operations (like x86), it would be possible to
avoid the register name overhead by having the LL instruction
not include a register name, taking it from the following
compute instruction. For an LL instruction lacking a register
name, if "microcode" calls were to be supported such call
instructions would need to specify a register name (or use a
defined, possibly function-specific ABI). An opcode-only LL
might reasonably have space for hint/directive metadata, which
might be useful.

My objection to specific atomic instructions is mainly that
they are specific. If an operation later becomes a reasonable
target for such an instruction, a new instruction must be
allocated to provide that operation. That new instruction would
only be available to new software.

For other operations, I am not certain what semantics make
sense. If a read at one address changes the behavior of another
access, does "atomic" behavior mean that the later in program
order access happens before the I/O agent changes the access
behavior or does it mean that the atomic action blocks "ordinary
software agents" but lets side effects caused by the action to
occur in program order?

Atomics ensure that the access is atomic with respect to
all other accessors - ensuring that the other accessors
will not see inconsistent data.

I think I communicated poorly. I was thinking about what the
appropriate behavior of an atomic add operation (however
encoded) should be when targeting an address with side effects.
The simple choice is "don't do that" (undefined behavior). The
slightly more complex choice is fault on bad behavior.

Yet one might argue that targeting such an address for an atomic
operation could be useful in some particular context. Supporting
such means making a choice of how the side effect is handled.

(I am inclined to just having such fault, but that needs to be
defined as it means that acquiring a lock, performing a read,
operating on the read value, writing the result, and releasing
the lock is not functionally equivalent to an atomic operation.)

Is the read side effect ignored? For side effects limited to the
accessed address, this would seem to be the same as the side
effect happening "between" the read and the write. For side
effects with external effects, those would also be suppressed,
making such different than having the side effect occur
"between" the read and the write.

Is the side effect done "between" the read and the write of the
"atomic" operation? This would presumably overwrite the address-
local side effect while producing other side effects, which
might seem very strange as the side effect would use the old
value for any value-dependent side effects.

Is the side effect performed after the atomic operation? This
could also be confusing.

Even if the side effect does not change the value at the
address, the value before or after the atomic operation might be
used to determine what the side effect is.

Removing side effects places atomics in a special category,
which may be reasonable but is not a choice 100% obvious to
everyone. Consistently and sensibly ordering side effects with
atomic seems challenging.

Such side effects are like atomic operations, which leads to a
conflict. If the non-side effect operation is truly atomic, one
might break the definition of the side effect.

I would guess that each device would choose its supported
behavior, but that would seem to add unnecessary complexity.
Just faulting on such use seems sensible, but then one needs
to distinguish between addresses that fault and addresses that
allow atomic operations.

I just looked it up, Power (version 2.06B) as an example
restricts Load Reserved to coherent memory: "The storage
location specified by the Load And Reserve and Store Conditional
instructions must be in storage that is Memory Coherence
Required if the location may be modified by another processor or
mechanism. If the specified location is in storage that is Write
Through Required or Caching Inhibited, the system data storage
error handler or the system alignment error handler is invoked
for the Server environment and may be invoked for the Embedded
environment." I therefore suspect that even if such was
extended to support PCI-E atomics, addresses with side effects
would fault.

Atomics can be used as a basis (e.g. atomic test&set) to
guard a critical section, but they're also useful for
adjusting shared counters et alia.

(There seem to be a lot of alia/other uses. Atomic OR seems like
a useful means of supporting multiple "named" read locks; if
implemented aggressively, atomic OR could even be used for
bit-sized locks in combination with atomic AND.)

My perception is that PCI-E atomics are not meant for
non-idempotent storage. (I do not know how ARM atomic
instructions handle such cases.

See above.

The "above" statement was not clear to me. An I/O device's
read side effect does not play nicely with the concept of
atomic. One could define the atomic not to actually "read"
the device register (no side effect), but I think one
cannot just say the operation is atomic.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jun 2 01:27:53 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> posted:

On 5/27/26 10:25 AM, Scott Lurndal wrote:

Paul Clayton <paaronclayton@gmail.com> writes:

[snip]

In the case of a simple operation, as has been stated before,
the LL/SC sequence can be converted to the equivalent of an
atomic instruction.

If true in the general case (and I'm not sure I see how it
can be), why bother to add the hardware to do so when
atomics are generally superior, scalable, simpler to implement and
higher performance?

A more generic interface has some advantages.

I already mentioned that old software that was developed when
there was not an atomic ["expensive" operation] instruction
could benefit from idiom recognition on new hardware. (An
alternative to that would be patching or recompiling the
software. While I prefer a more abstract software distribution
format for its ability to avoid having to move things to
Architecture and even potentially perform microarchitectural
optimizations at non-instruction granularity, such seems
unlikely to be common any time soon.)

Even with atomic instructions, the Architecture generally does
not provide guarantees about scalability. I doubt any
implementation would stop-the-world to perform an atomic
operation (because the performance penalty would be quite
noticeable), but I can easily imagine an implementation
waiting until the atomic operation is not speculative before
starting it.

Understand that LOCK XADD [...] to MMI/O does exactly this !

But note: XADD [...] never causes more than necessary bus traffic
and as an atomic event, never fails, never needs retry, ...
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jun 2 01:38:51 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> posted:

On 5/27/26 5:08 PM, Chris M. Thomasson wrote:

On 5/20/2026 4:47 PM, Paul Clayton wrote:

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the
bus lock and still make forward progress... Sigh... A
horrible LL/SC thing can live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a
quality of implementation issue.

Well, making a LOCK CAS, or say LOCK XADD, has certain inherent guarantees. Using LL/SC to emulate them is a different story.

Academic LL/SC: I can agree with this statement. But neither ASF nor
ESM has problems making stronger guarantees--and I did this over
{7 ASF, 8 ESM} cache lines not 1 single memory location. These aslo
impose limitation on instruction order and SW has to understand
several nonVoneumann properties of the ATOMIC event.

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

That standard academic stuff cannot, does not mean it absolutely
cannot be done.

IBM's constrained
transactions guaranteed success of a transaction if it met
certain criteria. A single-instruction LL/SC body could be
Architecturally guaranteed to perform not only successfully but
with some performance characteristics.

A guarantee of forward progress is not very useful if the
progress is glacially (or cosmologically) slow. ("We guarantee
that the operation will complete before the heat death of the
universe"ry|)

A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is
hyper important to help the software pad and align to remove any
false sharing on said granule. No? But...

I disagree. A guarantee that has a time scale beyond human
civilization much less the lifetime of the hardware seems to
have extremely little use. It may be reasonable to assume
reasonable timescales for such guarantees, but a simple
guarantee of eventual completion (if the system is kept
operating) might be given if the profit motive seems sufficient.

(I am not certain if even x86 XLOCK operations are absolutely
guaranteed to complete in the presence of context switches. A
hardware thread might be always be interrupted while it is
performing the operation and if the hardware does not delay
interrupt handling until after the operation completes, then the
operation may never complete. This may be so extraordinarily
improbable that an undetected error in ECC-protected memory
might be more likely, in which case it is not really important.)

I think one really wants the time scale explicitly declared as
well as information about the range of latency and causes. Even
5ms latency can seem like forever.

Here's the deeper problem can rear its ugly head... Vendors
often don't document it? Or they document it inconsistently
across revisions? So even if you do everything right in
principle, you're tuning against a number you had to dig out of
a forum post or reverse engineer yourself. Scary! ;^o

Ugh!

Architecting a lot of such factors might help with documentation
as Architecture is more stable than microarchitecture, but I do
not think typical companies have the incentives for excellence
in documentation. If the only consequence of mistakes in
Architectural documentation is a few software developers
grumbling, keeping even such stable documentation consistent and
correct (and abiding by the old/existing Architectural contract)
seems unlikely to seem important. In fact, if the inability to
optimize forces people to buy more (or more expensive) hardware,
poor documentation can mean higher profits.

It took me more than 35 years to learn how to write -|Architecture
documents such that a malevolent engineer could not misunderstand
what was written and specified. Try it, it is not easy. It is not
something that can be taught, but it is something that diligence
and perseverance can deliver.

Of course, the temptation toward "good enough" (not so bad
that one will lose too many customers) is a problem. I would
expect
documented guarantees of sufficient generality to have the
cognitive load for software developers be acceptable. That
such guarantees seem to be very rare is sad.

How many SC failures on a fetch-and-add are acceptable before
you conclude something's fundamentally broken? For me the answer
is: very few.

How many SC failures are acceptable if there are 1024 cores all
going after the same lock ??

Again, I think this is concerned with "quality of
implementation" (and Architectural guarantees about such) than
about the interface at an instruction level.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Tue Jun 2 14:42:12 2026

From Newsgroup: comp.arch

On 6/1/26 9:27 PM, MitchAlsup wrote:
[snip]

But note: XADD [...] never causes more than necessary bus traffic

I am skeptical that this is Architecturally guaranteed. It may
fall out of any even semi-sane implementation, in which case
programmers might be willing to take it as guaranteed. Yet I
suspect "sanity" may not be reliable with changing tradeoffs
(including whether protecting a company's reputation has value).

and as an atomic event, never fails, never needs retry, ...

I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
etc.) could provide such guarantees, even extending to multiple
contiguous instructions operating on data within an aligned
64-byte region.

Interestingly, it seems that IBM's z17 is the last
implementation to support constrained transactions. I do wonder
why this feature has been removed from the Architecture.

Constrained transactions had these restrictions (from https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
| - The transaction executes no more than 32 instructions.
| - All instructions within the transaction must be within 256
| contiguous bytes of storage.
| - The only branches you may use are relative branches that
| branch forward (so there can be no loops).
| - All SS and SSE-format instructions may not be used.
| - Additional general instructions may not be used.
| - The transaction's storage operands may not access more than
| four octowords.
| - The transaction may not access storage operands in any 4 |K-
| byte blocks that contain the 256 bytes of storage beginning
| with the TBEGINC instruction.
| - Operand references must be within a single doubleword,
| except for some of the "multiple" instructions for which the
| limitation is a single octoword.

I think I read that the first implementation made an optimistic
attempt and later rCo I do not remember if multiple optimistic
attempts were made rCo a hardware lock was used. Perhaps four
addresses cause too much of a slowdown when there is conflict???

I believe that guaranteeing completion would be substantially
easier with only one aligned 64-byte region. (As I think I
wrote before, adding a single "word" exportable atomic operation
in a different "cache block" _might_ be practical to implement
though I did not have an idea for software would express such.
I may be wrong that appending such an exportable operation would
not make ensuring completion significantly more difficult.)

I think such guaranteed atomic sequences would require a
distinct instruction not only to allow what IBM did (making such
an illegal/faulting instruction) but also to fault when the
instruction is misused since no fallback path is provided.

There also seem to be other operations that would not (I think)
be exceptionally difficult to guarantee. E.g., swapping cache
blocks might not be much more difficult to guarantee than quick
operations within a single cache block, though I do not know
how useful such an unconditional swap would be. Atomic cache
block copy would seem to be easier (it is similar to a block
zeroing instruction except that the value is taken from a block
that is not writeable by other agents being in exclusive or
shared state). Guaranteeing atomicity for a copy into a cache
block (where two contiguous cache blocks might be in the read
set and the write is only to part of a cache block) seems a
little more complicated.

With conventional cache coherence, partial writes seem likely to
be complex. If masked cache block updates were possible as an
exportable atomic operation, it might be practical to lock (NAK-
guard) a limited read set and push the update to the owner. I do
not know if such an update independent of previous values in the
written cache block would be useful.

I am certainly not comfortable thinking about the visibility/
ordering constraints, so my guesses are very wrong about what is
practical to guarantee as atomic.

Even if an operation can practically be guaranteed, it may not
be worthwhile to provide an interface that allows requesting
such a guaranteed atomic operation.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jun 2 19:36:06 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> posted:

On 6/1/26 9:27 PM, MitchAlsup wrote:
[snip]

But note: XADD [...] never causes more than necessary bus traffic

I am skeptical that this is Architecturally guaranteed. It may
fall out of any even semi-sane implementation, in which case
programmers might be willing to take it as guaranteed. Yet I
suspect "sanity" may not be reliable with changing tradeoffs
(including whether protecting a company's reputation has value).

The core is going to package this instruction up and ship it
across the interconnect as a fire-and-forget transaction.

The interconnect is going to route the package towards either a
cache having write permission or a control register.

The cache or control register will perform the packaged calculation
and optionally send back the previous value.

The core receives the optional previous value and the memory-atomic
is complete:: 2 interconnect messages, both smaller than a cache line,
not cache lines are moved, and the calculation cannot fail. The only
failure mode is if the interconnect message fails ECC check in either directions.

and as an atomic event, never fails, never needs retry, ...

I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
etc.) could provide such guarantees,

If so, you will be surprised when you implement one.

even extending to multiple
contiguous instructions operating on data within an aligned
64-byte region.

Where it becomes cubically harder.

Interestingly, it seems that IBM's z17 is the last
implementation to support constrained transactions. I do wonder
why this feature has been removed from the Architecture.

SW TM wants the TM model to support an unbounded number of memory
elements in the single transaction. HW does not do unbounded.

Constrained transactions had these restrictions (from https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
| - The transaction executes no more than 32 instructions.

I used a timer--to the same ends.

| - All instructions within the transaction must be within 256
| contiguous bytes of storage.

I allow calls to subroutines in the event.

| - The only branches you may use are relative branches that
| branch forward (so there can be no loops).

Loops are OK as long as the timer does not go off.

| - All SS and SSE-format instructions may not be used.

Agreed.

| - Additional general instructions may not be used.

I see no reason to limit general calculations and memory access.

| - The transaction's storage operands may not access more than
| four octowords.

8 cache lines participate, an unbounded number of cache lines
can be accessed as long as participants is no larger than 8.

| - The transaction may not access storage operands in any 4 |K-
| byte blocks that contain the 256 bytes of storage beginning
| with the TBEGINC instruction.

interdesting.

| - Operand references must be within a single doubleword,
| except for some of the "multiple" instructions for which the
| limitation is a single octoword.

Any normal memory references to the participating lines.

I think I read that the first implementation made an optimistic
attempt and later rCo I do not remember if multiple optimistic
attempts were made rCo a hardware lock was used. Perhaps four
addresses cause too much of a slowdown when there is conflict???

I believe that guaranteeing completion would be substantially
easier with only one aligned 64-byte region. (As I think I
wrote before, adding a single "word" exportable atomic operation
in a different "cache block" _might_ be practical to implement
though I did not have an idea for software would express such.
I may be wrong that appending such an exportable operation would
not make ensuring completion significantly more difficult.)

If you take the necessary 6 months to slug through all issues
you can find solutions for the disjoint participants to be at
least as large as the outstanding Miss Buffer size (or MB-1).

I think such guaranteed atomic sequences would require a
distinct instruction not only to allow what IBM did (making such
an illegal/faulting instruction) but also to fault when the
instruction is misused since no fallback path is provided.

If you do it right, your architecture sets up failure paths,
so that if failure happens, IP reverts to the failure point
without executing a branch instruction. I have an instruction
that samples 'interference' and changes the failure point as
a necessary addition. Any interrupt or exception transfers
control to failure point before performing exception control
transfer.

There also seem to be other operations that would not (I think)
be exceptionally difficult to guarantee. E.g., swapping cache
blocks might not be much more difficult to guarantee than quick
operations within a single cache block, though I do not know
how useful such an unconditional swap would be. Atomic cache
block copy would seem to be easier (it is similar to a block
zeroing instruction except that the value is taken from a block
that is not writeable by other agents being in exclusive or
shared state). Guaranteeing atomicity for a copy into a cache
block (where two contiguous cache blocks might be in the read
set and the write is only to part of a cache block) seems a
little more complicated.

The thing that makes this so difficult is that most -|Architectures
cannot guarantee that 2 cache lines are ever simultaneously present
in the cache. ASF and ESM have means to do this which greatly
strengthens the guarantee of forward progress.

My 66000 includes priority in memory transactions, and this enables
the cache with write permission to determine to allow the request
or to fail the request (request is at equal or lower priority) thus
allowing the higher priority ATOMIC event to make forward progress
at the expense of the lower priority event.

At certain times the core may be in a position where it can finish
an event if the cache lines can e guaranteed. During this period,
a core can NaK a request so that the event is guaranteed to finish.

With conventional cache coherence, partial writes seem likely to
be complex. If masked cache block updates were possible as an
exportable atomic operation, it might be practical to lock (NAK-
guard) a limited read set and push the update to the owner. I do
not know if such an update independent of previous values in the
written cache block would be useful.

It is much worse than that in practice. The interconnect protocol and
the cache coherence model HAVE to HAVE ATOMIC event forward progress
fully integrated. MESI and MOESI are insufficient here; most directory coherence protocols are also insufficient.

I am certainly not comfortable thinking about the visibility/
ordering constraints, so my guesses are very wrong about what is
practical to guarantee as atomic.

See Lamport...

Even if an operation can practically be guaranteed, it may not
be worthwhile to provide an interface that allows requesting
such a guaranteed atomic operation.

...
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Jun 2 13:52:39 2026

From Newsgroup: comp.arch

On 6/1/2026 6:38 PM, MitchAlsup wrote:

Paul Clayton <paaronclayton@gmail.com> posted:

On 5/27/26 5:08 PM, Chris M. Thomasson wrote:

On 5/20/2026 4:47 PM, Paul Clayton wrote:

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the
bus lock and still make forward progress... Sigh... A
horrible LL/SC thing can live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a
quality of implementation issue.

Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.

Academic LL/SC: I can agree with this statement. But neither ASF nor
ESM has problems making stronger guarantees--and I did this over
{7 ASF, 8 ESM} cache lines not 1 single memory location. These aslo
impose limitation on instruction order and SW has to understand
several nonVoneumann properties of the ATOMIC event.

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

That standard academic stuff cannot, does not mean it absolutely
cannot be done.

IBM's constrained
transactions guaranteed success of a transaction if it met
certain criteria. A single-instruction LL/SC body could be
Architecturally guaranteed to perform not only successfully but
with some performance characteristics.

A guarantee of forward progress is not very useful if the
progress is glacially (or cosmologically) slow. ("We guarantee
that the operation will complete before the heat death of the
universe"ry|)

A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is
hyper important to help the software pad and align to remove any
false sharing on said granule. No? But...

I disagree. A guarantee that has a time scale beyond human
civilization much less the lifetime of the hardware seems to
have extremely little use. It may be reasonable to assume
reasonable timescales for such guarantees, but a simple
guarantee of eventual completion (if the system is kept
operating) might be given if the profit motive seems sufficient.

(I am not certain if even x86 XLOCK operations are absolutely
guaranteed to complete in the presence of context switches. A
hardware thread might be always be interrupted while it is
performing the operation and if the hardware does not delay
interrupt handling until after the operation completes, then the
operation may never complete. This may be so extraordinarily
improbable that an undetected error in ECC-protected memory
might be more likely, in which case it is not really important.)

I think one really wants the time scale explicitly declared as
well as information about the range of latency and causes. Even
5ms latency can seem like forever.

Here's the deeper problem can rear its ugly head... Vendors
often don't document it? Or they document it inconsistently
across revisions? So even if you do everything right in
principle, you're tuning against a number you had to dig out of
a forum post or reverse engineer yourself. Scary! ;^o

Ugh!

Architecting a lot of such factors might help with documentation
as Architecture is more stable than microarchitecture, but I do
not think typical companies have the incentives for excellence
in documentation. If the only consequence of mistakes in
Architectural documentation is a few software developers
grumbling, keeping even such stable documentation consistent and
correct (and abiding by the old/existing Architectural contract)
seems unlikely to seem important. In fact, if the inability to
optimize forces people to buy more (or more expensive) hardware,
poor documentation can mean higher profits.

It took me more than 35 years to learn how to write -|Architecture
documents such that a malevolent engineer could not misunderstand
what was written and specified. Try it, it is not easy. It is not
something that can be taught, but it is something that diligence
and perseverance can deliver.

Of course, the temptation toward "good enough" (not so bad
that one will lose too many customers) is a problem. I would
expect
documented guarantees of sufficient generality to have the
cognitive load for software developers be acceptable. That
such guarantees seem to be very rare is sad.

How many SC failures on a fetch-and-add are acceptable before
you conclude something's fundamentally broken? For me the answer
is: very few.

How many SC failures are acceptable if there are 1024 cores all
going after the same lock ??

Again, I think this is concerned with "quality of
implementation" (and Architectural guarantees about such) than
about the interface at an instruction level.

Simple... Do NOT allow 1024 cores to hammer a single location!

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Jun 2 14:20:44 2026

From Newsgroup: comp.arch

On 6/2/2026 2:15 PM, Chris M. Thomasson wrote:

On 6/2/2026 12:36 PM, MitchAlsup wrote:

Paul Clayton <paaronclayton@gmail.com> posted:

On 6/1/26 9:27 PM, MitchAlsup wrote:
[snip]

But note: XADD [...] never causes more than necessary bus traffic

I am skeptical that this is Architecturally guaranteed. It may
fall out of any even semi-sane implementation, in which case
programmers might be willing to take it as guaranteed. Yet I
suspect "sanity" may not be reliable with changing tradeoffs
(including whether protecting a company's reputation has value).

The core is going to package this instruction up and ship it
across the interconnect as a fire-and-forget transaction.

The interconnect is going to route the package towards either a
cache having write permission or a control register.

The cache or control register will perform the packaged calculation
and optionally send back the previous value.

The core receives the optional previous value and the memory-atomic
is complete:: 2 interconnect messages, both smaller than a cache line,
not cache lines are moved, and the calculation cannot fail. The only
failure mode is if the interconnect message fails ECC check in either
directions.

and as an atomic event, never fails, never needs retry, ...

I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
etc.) could provide such guarantees,

If so, you will be surprised when you implement one.

-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a even extending to multiple
contiguous instructions operating on data within an aligned
64-byte region.

Where it becomes cubically harder.

Interestingly, it seems that IBM's z17 is the last
implementation to support constrained transactions. I do wonder
why this feature has been removed from the Architecture.

SW TM wants the TM model to support an unbounded number of memory
elements in the single transaction. HW does not do unbounded.

Constrained transactions had these restrictions (from
https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-
transactions):
| - The transaction executes no more than 32 instructions.

I used a timer--to the same ends.

| - All instructions within the transaction must be within 256
|-a-a contiguous bytes of storage.

I allow calls to subroutines in the event.

| - The only branches you may use are relative branches that
|-a-a branch forward (so there can be no loops).

Loops are OK as long as the timer does not go off.

| - All SS and SSE-format instructions may not be used.

Agreed.

| --a Additional general instructions may not be used.

I see no reason to limit general calculations and memory access.

| - The transaction's storage operands may not access more than
|-a-a four octowords.

8 cache lines participate, an unbounded number of cache lines
can be accessed as long as participants is no larger than 8.

| - The transaction may not access storage operands in any 4 |K-
|-a-a byte blocks that contain the 256 bytes of storage beginning
|-a-a with the TBEGINC instruction.

interdesting.

| - Operand references must be within a single doubleword,
|-a-a except for some of the "multiple" instructions for which the
|-a-a limitation is a single octoword.

Any normal memory references to the participating lines.

I think I read that the first implementation made an optimistic
attempt and later rCo I do not remember if multiple optimistic
attempts were made rCo a hardware lock was used. Perhaps four
addresses cause too much of a slowdown when there is conflict???

I believe that guaranteeing completion would be substantially
easier with only one aligned 64-byte region. (As I think I
wrote before, adding a single "word" exportable atomic operation
in a different "cache block" _might_ be practical to implement
though I did not have an idea for software would express such.
I may be wrong that appending such an exportable operation would
not make ensuring completion significantly more difficult.)

If you take the necessary 6 months to slug through all issues
you can find solutions for the disjoint participants to be at
least as large as the outstanding Miss Buffer size (or MB-1).

I think such guaranteed atomic sequences would require a
distinct instruction not only to allow what IBM did (making such
an illegal/faulting instruction) but also to fault when the
instruction is misused since no fallback path is provided.

If you do it right, your architecture sets up failure paths,
so that if failure happens, IP reverts to the failure point
without executing a branch instruction. I have an instruction
that samples 'interference' and changes the failure point as
a necessary addition. Any interrupt or exception transfers
control to failure point before performing exception control
transfer.

There also seem to be other operations that would not (I think)
be exceptionally difficult to guarantee. E.g., swapping cache
blocks might not be much more difficult to guarantee than quick
operations within a single cache block, though I do not know
how useful such an unconditional swap would be. Atomic cache
block copy would seem to be easier (it is similar to a block
zeroing instruction except that the value is taken from a block
that is not writeable by other agents being in exclusive or
shared state). Guaranteeing atomicity for a copy into a cache
block (where two contiguous cache blocks might be in the read
set and the write is only to part of a cache block) seems a
little more complicated.

The thing that makes this so difficult is that most -|Architectures
cannot guarantee that 2 cache lines are ever simultaneously present
in the cache. ASF and ESM have means to do this which greatly
strengthens the guarantee of forward progress.

My 66000 includes priority in memory transactions, and this enables
the cache with write permission to determine to allow the request
or to fail the request (request is at equal or lower priority) thus
allowing the higher priority ATOMIC event to make forward progress
at the expense of the lower priority event.

At certain times the core may be in a position where it can finish
an event if the cache lines can e guaranteed. During this period,
a core can NaK a request so that the event is guaranteed to finish.

With conventional cache coherence, partial writes seem likely to
be complex. If masked cache block updates were possible as an
exportable atomic operation, it might be practical to lock (NAK-
guard) a limited read set and push the update to the owner. I do
not know if such an update independent of previous values in the
written cache block would be useful.

It is much worse than that in practice. The interconnect protocol and
the cache coherence model HAVE to HAVE ATOMIC event forward progress
fully integrated. MESI and MOESI are insufficient here; most directory
coherence protocols are also insufficient.

I am certainly not comfortable thinking about the visibility/
ordering constraints, so my guesses are very wrong about what is
practical to guarantee as atomic.

See Lamport...

Even if an operation can practically be guaranteed, it may not
be worthwhile to provide an interface that allows requesting
such a guaranteed atomic operation.

...

Well, we can do something... we know that lock cmpxchg8b on a 32 bit
system can handle two adjacent cache lines. So, we can try to hold more
than that, but! its not ideal. For instance my multex can do it and
emulate it. Read all https://groups.google.com/g/comp.lang.c++/c/ sV4WC_cBb9Q/m/SkSqpSxGCAAJ

I think that is why AMD allowed for LOCK RMW along with LL/SC?!
--- Synchronet 3.22a-Linux NewsLink 1.2

From Andy Valencia@vandys@vsta.org to comp.arch on Tue Jun 2 17:11:11 2026

From Newsgroup: comp.arch

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

Now, there was no thought of hundreds (or thousands) of CPU's. But
some of the pessimistic assumptions you might make of LL/SC (at least
as available in MIPS CPU's of that era) might need to be
revisited. Our best analysis said it would scale to very large
(for that time) database workloads.

Finances and other management things cancelled the program. Sequent
eventually went with their NUMA, ultimately being acquired by IBM. We
never found out how that system would've done in the real world.

I seem to remember its code name was "Model R" (RISC).

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
No AI was used in the composition of this message
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jun 3 18:19:28 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

Let's see:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Jun 3 12:57:42 2026

From Newsgroup: comp.arch

On 6/3/2026 11:19 AM, Anton Ertl wrote:

Paul Clayton <paaronclayton@gmail.com> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

Let's see:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

Hammering a single location is going to be bad for LL/SC or LOCK RMW, regardless of the ins and outs of LL/SC vs LOCK RMW. Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread counter and summing them when we need to observe the actual count?
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jun 3 20:53:49 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/3/2026 11:19 AM, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

Hammering a single location is going to be bad for LL/SC or LOCK RMW, >regardless of the ins and outs of LL/SC vs LOCK RMW.

It's two locations in these benchmarks: X and Y.

Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread >counter and summing them when we need to observe the actual count?

These benchmarks use per-thread storage: They are single-threaded.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Jun 3 15:15:53 2026

From Newsgroup: comp.arch

On 6/3/2026 1:53 PM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/3/2026 11:19 AM, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

Hammering a single location is going to be bad for LL/SC or LOCK RMW,
regardless of the ins and outs of LL/SC vs LOCK RMW.

It's two locations in these benchmarks: X and Y.

Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread
counter and summing them when we need to observe the actual count?

These benchmarks use per-thread storage: They are single-threaded.

Humm... I missed that. Anyway, you need to test them multi threaded...
Say our counters are per thread so an increment adds to its per-thread
counter instead of using a LOCK RMW. Then when the counter needs to be
sampled we can start summing up the per thread counts...

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Jun 4 14:21:16 2026

From Newsgroup: comp.arch

Andy Valencia <vandys@vsta.org> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we investigated
MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
SPP. After evaluation, we chose Pentium Pro to build the system
(using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.
--- Synchronet 3.22a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Jun 4 10:23:36 2026

From Newsgroup: comp.arch

On 2026-Jun-03 14:19, Anton Ertl wrote:

Paul Clayton <paaronclayton@gmail.com> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

Let's see:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

- anton

On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.
CMPXCHG does not do this - to be atomic it must have a LOCK prefix.

--- Synchronet 3.22a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Jun 4 10:25:06 2026

From Newsgroup: comp.arch

On 2026-Jun-03 16:53, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/3/2026 11:19 AM, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

Hammering a single location is going to be bad for LL/SC or LOCK RMW,
regardless of the ins and outs of LL/SC vs LOCK RMW.

It's two locations in these benchmarks: X and Y.

Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread
counter and summing them when we need to observe the actual count?

These benchmarks use per-thread storage: They are single-threaded.

- anton

They might be allocated in the same cache line.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Jun 4 21:04:28 2026

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

On 2026-Jun-03 14:19, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

...

On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.

The code for "x !@" is:

mov 0x8(%rbx),%r15
mov %r13,%rax
mov (%r15),%r13
mov %rax,(%r15)

while the code for "x atomic!@" is:

mov %r13,(%r10)
sub $0x8,%r10
mov 0x8(%rbx),%r13
mov 0x8(%r10),%rax
add $0x8,%r10
xchg %rax,0x0(%r13)
mov %rax,%r13

As you can see, there is no XCHG in the !@ code.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Jun 4 18:28:43 2026

From Newsgroup: comp.arch

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <vandys@vsta.org> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
SPP. After evaluation, we chose Pentium Pro to build the system
(using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was pretty
fast for certain worksets and algorithms. RMO mode.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Jun 4 18:33:41 2026

From Newsgroup: comp.arch

On 6/4/2026 2:04 PM, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

On 2026-Jun-03 14:19, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

...

On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.

The code for "x !@" is:

mov 0x8(%rbx),%r15
mov %r13,%rax
mov (%r15),%r13
mov %rax,(%r15)

while the code for "x atomic!@" is:

mov %r13,(%r10)
sub $0x8,%r10
mov 0x8(%rbx),%r13
mov 0x8(%r10),%rax
add $0x8,%r10
xchg %rax,0x0(%r13)
mov %rax,%r13

As you can see, there is no XCHG in the !@ code.

How is your data organized? Show me the structure?
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Jun 4 21:20:20 2026

From Newsgroup: comp.arch

On 6/4/2026 2:04 PM, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

On 2026-Jun-03 14:19, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

...

On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.

The code for "x !@" is:

mov 0x8(%rbx),%r15
mov %r13,%rax
mov (%r15),%r13
mov %rax,(%r15)

while the code for "x atomic!@" is:

mov %r13,(%r10)
sub $0x8,%r10
mov 0x8(%rbx),%r13
mov 0x8(%r10),%rax
add $0x8,%r10
xchg %rax,0x0(%r13)
mov %rax,%r13

As you can see, there is no XCHG in the !@ code.

XCHG does have the implied LOCK as EricP mentioned.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Jun 4 22:56:47 2026

From Newsgroup: comp.arch

On 6/4/2026 6:33 PM, Chris M. Thomasson wrote:

On 6/4/2026 2:04 PM, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

On 2026-Jun-03 14:19, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
-a-a-a-a-a 1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
-a-a-a-a-a 1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

...

On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.

The code for "x !@" is:

mov-a-a-a 0x8(%rbx),%r15
mov-a-a-a %r13,%rax
mov-a-a-a (%r15),%r13
mov-a-a-a %rax,(%r15)

while the code for "x atomic!@" is:

mov-a-a-a %r13,(%r10)
sub-a-a-a $0x8,%r10
mov-a-a-a 0x8(%rbx),%r13
mov-a-a-a 0x8(%r10),%rax
add-a-a-a $0x8,%r10
xchg-a-a %rax,0x0(%r13)
mov-a-a-a %rax,%r13

As you can see, there is no XCHG in the !@ code.

How is your data organized? Show me the structure?

// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};

// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 5 09:04:51 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC.

I remember listening to a presentation by a student of a collegue
about implementing garbage collection for IIRC big SGI machines. In
addition to LL/SC, they had atomic stuff stuch as fetch-and-add
implemented in the memory subsystem, not in the processor, and that
apparently was needed for contended cases to avoid the round-trip time
through the caches of individual processors. My understanding is
that, while viewed from the perspective of an individual core, the
atomic instructions were slow, the throughput in the contended case
was significantly higher than with LL/SC or an atomic mechanism
implemented in the individual CPUs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 5 09:12:03 2026

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

These benchmarks use per-thread storage: They are single-threaded.

...

They might be allocated in the same cache line.

Given that they are accessed by the same thread, I don't expect that
to hurt, but I did separate the variables by at least 64 bytes in my
recent runs just in case.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 5 09:14:29 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/4/2026 2:04 PM, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

On 2026-Jun-03 14:19, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

...

How is your data organized? Show me the structure?

Shown above. Or, in today's testing:

variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 5 10:20:30 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};

// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...

Why would alignment to cache-line boundaries be necessary?

Anyway, let's see if it makes a difference.

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).

And here are the results (on a Ryzen 8700G):

The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:

!@ +!@ barr
2.4 2.4 1.8 A B C
2.4 2.4 1.9 D E

For the atomic/barrier variants the cycles are:

!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.2 8.3 8.5-11.2 C
9.3 8.3 9.1-11 D
9.1 8.3 7.3-11 E

The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that. The other columns show only small variations.
In any case the aligning and padding recommended by you is not
superior to the original code, which just uses two variables.

Here's the code:

1 [if]
variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

[else]
: cache-align here dup 64 naligned >align ;
cache-align
here 1 , cache-align here -1 , constant y constant x
[endif]

The part before the [else] is A, comment out "64 allot" for B.

The part after the [else] is D, delete the second CACHE-ALIGN for C,
and replace it with "64 allot" for E.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Jun 5 13:43:11 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <vandys@vsta.org> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we investigated
MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
SPP. After evaluation, we chose Pentium Pro to build the system
(using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was pretty
fast for certain worksets and algorithms. RMO mode.

Both technical and business reasons.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Andy Valencia@vandys@vsta.org to comp.arch on Fri Jun 5 07:07:07 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was pretty
fast for certain worksets and algorithms. RMO mode.

Sun came through Cisco as well, I don't recall which generation of
chips, but I remember their focus was on the interface to memory
itself, targeting radically reduced latency and much higher bandwidth.
We weren't sure they would get their design out the door, and we were
pretty sure indeed that they wouldn't make a good enough embedded
CPU for our purposes. Too big, too hot, too expensive, and so forth.

At that time (MANY years ago now) Cisco's core router OS was big endian
only. That kept us from considering x86.

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
No AI was used in the composition of this message
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 5 15:11:22 2026

From Newsgroup: comp.arch

On 6/5/2026 3:20 AM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};

// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...

Why would alignment to cache-line boundaries be necessary?

Anyway, let's see if it makes a difference.

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).

And here are the results (on a Ryzen 8700G):

The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:

!@ +!@ barr
2.4 2.4 1.8 A B C
2.4 2.4 1.9 D E

For the atomic/barrier variants the cycles are:

!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.2 8.3 8.5-11.2 C
9.3 8.3 9.1-11 D
9.1 8.3 7.3-11 E

The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that. The other columns show only small variations.
In any case the aligning and padding recommended by you is not
superior to the original code, which just uses two variables.

Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock. Single-threaded avoid straddling cache line boundaries to prevent bus locks on LOCK prefixed instructions
Multi-threaded pad and align to prevent false sharing between
independently accessed variables.

For instance you don't want a mutex word to false share with say an
atomic counter that has nothing to do with the mutex. They need to be
padded and aligned...

Here's the code:

1 [if]
variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

[else]
: cache-align here dup 64 naligned >align ;
cache-align
here 1 , cache-align here -1 , constant y constant x
[endif]

The part before the [else] is A, comment out "64 allot" for B.

The part after the [else] is D, delete the second CACHE-ALIGN for C,
and replace it with "64 allot" for E.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 5 15:27:04 2026

From Newsgroup: comp.arch

On 6/5/2026 12:04 AM, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Paul Clayton <paaronclayton@gmail.com> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

I have revised the benchmarks as follows: I have added a test of a
memory barrier, which is implemented in GNU C as

__atomic_thread_fence(__ATOMIC_SEQ_CST);

The barriers separate loads.

I have increased the loop count by a factor of 10, because I did not
subtract the startup overhead of Gforth; as a result, the startup
overhead is reduced from 3.3 cycles per execution of the relevant word
to 0.33 cycles.

I have also inserted 64 bytes between the variables, so that they are
in different cache lines. This should not make a difference, because
all accesses are in the same thread (i.e., no cache-ping-pong from
possible false sharing), but just in case.

What I did not do is to use several threads. The idea here is that programmers will take measures that ensure that contention is rare,
but you still need to use atomic instructions and barriers to ensure correctness. Ideally in this case the atomic instructions and
barriers have no extra cost, but in reality, they do have extra cost.

Indeed.

[snip results]

Thanks.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 5 15:40:13 2026

From Newsgroup: comp.arch

On 6/5/2026 12:04 AM, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Paul Clayton <paaronclayton@gmail.com> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

I have revised the benchmarks as follows: I have added a test of a
memory barrier, which is implemented in GNU C as

__atomic_thread_fence(__ATOMIC_SEQ_CST);

The barriers separate loads.

[...]

On x86, well, did it fall back to MFENCE? Or use a dummy LOCK RMW on a
per thread stack location? Iirc some compilers would use a dummy. Oh
shit man, 20+ish years ago I was running all sorts of benchmarks on
MFENCE vs LOCK RMW. Or MFENCE vs MEMBAR #StoreLoad | #LoadStore |
#StoreStore | #LoadLoad on the SPARC. I could not really directly test
LOCK RMW wrt x86 on the SPARC because all of the sparcs aromic RMW's are naked. I would have to manually add the barriers to make it TSO in RMO mode. --- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 5 15:43:14 2026

From Newsgroup: comp.arch

On 6/5/2026 3:11 PM, Chris M. Thomasson wrote:

On 6/5/2026 3:20 AM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

// padded to a l2 cache line
struct A
{
-a-a-a-a unsigned word m_data;
-a-a-a-a char padding[...];
};

// padded to a l2 cache line
struct B
{
-a-a-a-a unsigned word m_data;
-a-a-a-a char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...

Why would alignment to cache-line boundaries be necessary?

Anyway, let's see if it makes a difference.

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today).-a A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).

And here are the results (on a Ryzen 8700G):

The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:

-a-a !@-a-a +!@ barr
-a-a 2.4-a 2.4-a 1.8 A B C
-a-a 2.4-a 2.4-a 1.9 D E

For the atomic/barrier variants the cycles are:

-a-a !@-a-a +!@ barr
-a-a 9.3-a 8.3-a 7.2 A
-a-a 9.2-a 8.3-a 7.1 B
-a-a 9.2-a 8.3-a 8.5-11.2 C
-a-a 9.3-a 8.3-a 9.1-11-a-a D
-a-a 9.1-a 8.3-a 7.3-11-a-a E

The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that.-a The other columns show only small variations.
In any case the aligning and padding recommended by you is not
superior to the original code, which just uses two variables.

Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock. Single-threaded avoid straddling cache line boundaries to prevent bus locks on LOCK prefixed instructions

Actually try to avoid LOCK prefixed anything on single threaded... Even
XCHG has that implied LOCK prefix. :^)

Multi-threaded pad and align to prevent false sharing between
independently accessed variables.

For instance you don't want a mutex word to false share with say an
atomic counter that has nothing to do with the mutex. They need to be
padded and aligned...

Here's the code:

1 [if]
variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

[else]
-a-a-a-a : cache-align here dup 64 naligned >align ;
-a-a-a-a cache-align
-a-a-a-a here 1 , cache-align here -1 , constant y constant x
[endif]

The part before the [else] is A, comment out "64 allot" for B.

The part after the [else] is D, delete the second CACHE-ALIGN for C,
and replace it with "64 allot" for E.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 6 08:14:17 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/5/2026 12:04 AM, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I have revised the benchmarks as follows: I have added a test of a
memory barrier, which is implemented in GNU C as

__atomic_thread_fence(__ATOMIC_SEQ_CST);

The barriers separate loads.

[...]

On x86, well, did it fall back to MFENCE? Or use a dummy LOCK RMW on a
per thread stack location?

On AMD64, the latter. The code generated by gcc for the line above
is:

lock orq $0x0,(%rsp)

On ARM A64 gcc generates the following:

dmb ish

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 6 08:30:45 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Anyway, let's see if it makes a difference.

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

[...]

And here are the results (on a Ryzen 8700G):

The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:

!@ +!@ barr
2.4 2.4 1.8 A B C
2.4 2.4 1.9 D E

For the atomic/barrier variants the cycles are:

!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.2 8.3 8.5-11.2 C
9.3 8.3 9.1-11 D
9.1 8.3 7.3-11 E

The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that.

Now I have: It's the placement of the native code. If I compile
another definition

: dummy1 swap over 2rot ;

that is never called before all the others, the result for D becomes:

!@ +!@ barr
9.3 8.3 7.2 D

with little variation. So it seems that the code placement of the bench-barrier word ran into some microarchitectural hickup of Zen4.

Now that I have that problem worked around, let's see if the data
placement makes a difference:

!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.3 8.3 7.0 C
9.3 8.3 7.2 D
9.3 8.3 7.2 E

Making them adjacent in the same cache line is not disadvantage as
long as there is no actual communication going on. Of course, in an
actual application you want them in different cache lines, because
then you will have communication, or using atomic accesses or barrier
would not be pointless.

Code (with the data part set up for E):

0 [if]
variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

[else]
: dummy1 swap over 2rot ;
: cache-align here dup 64 naligned >align ;
cache-align
here 1 , ( cache-align ) 64 allot here -1 , constant y constant x
[endif]

: bench-!@
1 50_000_000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 50_000_000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 50_000_000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 50_000_000 0 do x atomic+!@ y atomic+!@ loop drop ;

: bench-nobarrier
50_000_000 0 do x @ y @ 2drop loop ;

: bench-barrier
50_000_000 0 do x @ barrier y @ barrier 2drop loop ;

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 6 08:49:06 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/5/2026 3:20 AM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};

// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...

Why would alignment to cache-line boundaries be necessary?

[...]

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).

...

Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock.

All of the data placement variants use word-aligned words and thus do
not straddle cache lines. But your claim was that one should use only
the first word in a cache line. Avoiding false sharing is important,
if there is any sharing, but that's not the case for this benchmark.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Jun 6 11:52:09 2026

From Newsgroup: comp.arch

On 6/6/2026 1:49 AM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/5/2026 3:20 AM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};

// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need >>>> to pad _and_ align...

Why would alignment to cache-line boundaries be necessary?

[...]

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).

...

Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock.

All of the data placement variants use word-aligned words and thus do
not straddle cache lines. But your claim was that one should use only
the first word in a cache line. Avoiding false sharing is important,
if there is any sharing, but that's not the case for this benchmark.

Fair enough! :^) For a single-threaded benchmark with no concurrent
sharing, you are right. The layout variants you described ensure no
single word straddles a cache-line boundary, which completely avoids the split-access or bus-lock penalty on a single core. In that specific
context, packing things tightly is "superior" because my defensive
padding would just bloat the working set and cause unnecessary cache misses.

Fwiw, my advice to align and pad so a variable exclusively owns the
first word of a cache line is a habit born entirely out of
multi-threaded, lock/wait-free architecture design.

Actually, there is a fundamental difference in intent:

Word Alignment: Keeps a single thread from split-concurrency penalties (straddling). No word from cache line A bleeding into cache line B.

Cache-Line Alignment + Padding: Keeps different threads on different
cores from causing hardware cache-coherence storms (false sharing). Very
bad!

If struct A and struct B live in the exact same cache line, they are
safe from straddling. But the moment Core 0 writes to A and Core 1
writes to B, the underlying MESI cache-coherence protocol will violently bounce that single cache line back and forth between L1 caches.

Since your benchmark doesn't have concurrent sharing, you only care
about #1. I default to engineering for #2 defensively because the moment
code scales out to multiple threads, a well-aligned but unpadded
structure can cause performance to fall off a cliff.

Actually, do you remember the thread offset fiasco from Intel? I
remember reading a white paper wrt hyper threading, that the thread
stacks should be offset from each other to avoid false sharing. It was a
work around for a design error, iirc?
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Geek2
  Thu Jul 2 11:41:05 2026
  from Euclid, Oh via Telnet
- Hannibal
  Thu Jul 2 05:49:27 2026
  from Des Moines via SSH
- Geek2
  Wed Jul 1 16:31:20 2026
  from Euclid, Oh via Telnet
- Hannibal
  Tue Jun 30 16:45:42 2026
  from Des Moines via SSH

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	70
Nodes:	6 (0 / 6)
Uptime:	39:14:32
Calls:	948
Calls today:	2
Files:	1,325
Messages:	280,644

Re: ARM CAS vs LL/SC

Who's Online

Recent Visitors

System Info