Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >reason they add CAS to the architecture?
Scalability.
Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
Scalability.
Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
Scalability.
Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia instructions).
And why implement both atomics and LL/SC in a new architecture?
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >>>>> reason they add CAS to the architecture?
Scalability.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a Moving the contention detection to the cache >>>> is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia
instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how valid it
-a is.-a LL/SC provides a very flexible "framework" for implementing whatever atomic operation seems right for a particular application, but
the atomic operations are more efficient if you want to do exactly what
they do.
Think back to the time when the only atomic operation supported was essentially test and set, i.e. before CPUs had atomic fetch and add instructions.-a If you wanted the functionality of atomic fetch and add,
you would have had to do a TS instruction, followed by a load, then an
add, then a store and finally a clear test and set - five instructions.
Now think about the same thing if you had LL/SC.-a It would be LL, add,
SC - three instructions.-a Of course, if you had the atomics, it would be one instruction.-a Of course, a similar argument applies to a later generation of systems with respect to CAS/DCAS.
So, having both gives you better efficiency for the cases where your requirements are met by the atomic instructions, but the LL/SC gives you better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design cost of having both is a separate discussion.
On 5/5/26 21:38, Stephen Fuld wrote:XADD is the better version of CAS, imho.
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?
Scalability.
|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a Moving the contention detection to the
cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et >>> alia
instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how valid >> it -a|e-a is.|e-a LL/SC provides a very flexible "framework" for
implementing whatever atomic operation seems right for a particular
application, but the atomic operations are more efficient if you want >> to do exactly what they do.
Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions.|e-a If you wanted the functionality of atomic fetch and
add, you would have had to do a TS instruction, followed by a load,
then an add, then a store and finally a clear test and set - five
instructions. Now think about the same thing if you had LL/SC.|e-a It
would be LL, add, SC - three instructions.|e-a Of course, if you had the
atomics, it would be one instruction.|e-a Of course, a similar argument
applies to a later generation of systems with respect to CAS/DCAS.
So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives
you better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design cost >> of having both is a separate discussion.
Cavium cnMIPS (OCTEONII/III) implement both.-a The LL/SC has lower > latency for the uncontested path.-a The CAS hit the L2.-a I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed > progress?
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?
Scalability.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a Moving the contention detection to the cache >>>>> is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia >>> instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how valid it
-a is.-a LL/SC provides a very flexible "framework" for implementing
whatever atomic operation seems right for a particular application, but
the atomic operations are more efficient if you want to do exactly what
they do.
Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions.-a If you wanted the functionality of atomic fetch and add,
you would have had to do a TS instruction, followed by a load, then an
add, then a store and finally a clear test and set - five instructions.
Now think about the same thing if you had LL/SC.-a It would be LL, add,
SC - three instructions.-a Of course, if you had the atomics, it would be >> one instruction.-a Of course, a similar argument applies to a later
generation of systems with respect to CAS/DCAS.
So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives you
better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design cost of
having both is a separate discussion.
Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the >reasoning was the same, CAS wins for higher core counts and guaranteed >progress?
Kevin Bowling wrote:------------------
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?
Scalability.
Cavium cnMIPS (OCTEONII/III) implement both.-a The LL/SC has lower
latency for the uncontested path.-a The CAS hit the L2.-a I'd guess the reasoning was the same, CAS wins for higher core counts and guaranteed progress?
XADD is the better version of CAS, imho.
It allows actual progress for multiple contending threads, with just the minimum-possible delay for transferring ownership.
To improve on it, I think you need a distributed arbiter that can handle multiple incoming requests and send back the correct response to all of them, with zero actual memory transfers.
I.e. all such semaphor memory addresses would actually end up inside the arbiter, so that it could
respond as if it was just RAM but with far lower latency.
The question is who makes that kind of memory controller?
Terje
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Kevin Bowling wrote:------------------
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >> >>>>>> reason they add CAS to the architecture?
Scalability.
Cavium cnMIPS (OCTEONII/III) implement both.-a The LL/SC has lower
latency for the uncontested path.-a The CAS hit the L2.-a I'd guess the >> > reasoning was the same, CAS wins for higher core counts and guaranteed
progress?
XADD is the better version of CAS, imho.
It allows actual progress for multiple contending threads, with just the
minimum-possible delay for transferring ownership.
Under heavy contention, yes;
under light contention, no.
XADD has DRAM+coherence latency minimum.
ERROR "unexpected byte sequence starting at index 630: '\xC2'" while decoding:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Kevin Bowling wrote:------------------
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
Scalability.
Cavium cnMIPS (OCTEONII/III) implement both.|e-a The LL/SC has lower
latency for the uncontested path.|e-a The CAS hit the L2.|e-a I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed >> > progress?
XADD is the better version of CAS, imho.
It allows actual progress for multiple contending threads, with just the >> minimum-possible delay for transferring ownership.
Under heavy contention, yes;
under light contention, no.
XADD has DRAM+coherence latency minimum.
Surely the XADD can be handled by the L2
or L3 (PoC - Point of Coherency) that owns
the line.
Only when caching is disabled for a
particular address, will the XADD need to
be done at the DRAM controller.
On 5/6/26 10:19, Scott Lurndal wrote:
Kevin Bowling <kevin.bowling@kev009.com> writes:
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the
specific
reason they add CAS to the architecture?
Scalability.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear,
et alia
instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how
valid it
-a -a is.-a LL/SC provides a very flexible "framework" for implementing >>>> whatever atomic operation seems right for a particular application, but >>>> the atomic operations are more efficient if you want to do exactly what >>>> they do.
Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions.-a If you wanted the functionality of atomic fetch and add, >>>> you would have had to do a TS instruction, followed by a load, then an >>>> add, then a store and finally a clear test and set - five instructions. >>>> Now think about the same thing if you had LL/SC.-a It would be LL, add, >>>> SC - three instructions.-a Of course, if you had the atomics, it
would be
one instruction.-a Of course, a similar argument applies to a later
generation of systems with respect to CAS/DCAS.
So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives
you
better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design
cost of
having both is a separate discussion.
Cavium cnMIPS (OCTEONII/III) implement both.-a The LL/SC has lower
latency for the uncontested path.-a The CAS hit the L2.-a I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed
progress?
The cnMIPS-based CN7800 supported up to 48 cores, as did the ARMv8
CN8800.-a-a Both supported
cache coherency across multiple sockets (up to 4 for the CN7800).
The CN8800
implemented the new ARMv8.1 Large Systems Extension (i.e. atomic
instructions)
from the start as it was realized that the load/store exclusive
paradigm in arm V8.0 was
a performance limition with multisocket CN8800 processors.
Subsequent ARMv8/9 processors in the Octeon family have only supported
single
socket implementations, with large on-chip core counts.
CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel)
processor chips.
I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC,
which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.
-a Plus if
reclamation stops because of a stalled thread, is your queue still
lock free?
On 5/6/2026 3:30 PM, jseigh wrote:
I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC,
which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.
How? Say using it raw with dynamic nodes. The user deletes a node. How
does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.
On 5/7/26 00:50, Chris M. Thomasson wrote:
On 5/6/2026 3:30 PM, jseigh wrote:
I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC,
which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.
How? Say using it raw with dynamic nodes. The user deletes a node. How
does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.
Dequeuing from head is straightforward.-a I don't think I need to
describe that.
Enqueuing uses the old Double Tap trick.
1) load locked from tail node next pointer.
2) if null and tail still (double tap) points to node,
-a-a store conditional new node address into the next pointer.
Updating tail by doing a load locked on it,
loading the next pointer from tail node, and
updating tail.
It's ok to access deleted memory and long as you don't
actually use it.-a The old lock-free stack using
DCAS did that when unsuccessfully popping the stack.
It may have had an invalid value but the DCAS will fail.
On 5/7/2026 2:34 PM, jseigh wrote:
On 5/7/26 00:50, Chris M. Thomasson wrote:
On 5/6/2026 3:30 PM, jseigh wrote:
I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC, >>> which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.
How? Say using it raw with dynamic nodes. The user deletes a node. How
does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.
Dequeuing from head is straightforward.-a I don't think I need to
describe that.
Enqueuing uses the old Double Tap trick.
1) load locked from tail node next pointer.
2) if null and tail still (double tap) points to node,
-a-a store conditional new node address into the next pointer.
Updating tail by doing a load locked on it,
loading the next pointer from tail node, and
updating tail.
It's ok to access deleted memory and long as you don't
actually use it.-a The old lock-free stack using
DCAS did that when unsuccessfully popping the stack.
It may have had an invalid value but the DCAS will fail.
You mean DWCAS? For the Michael-Scott lock-free queue; it did not need
DCAS. I try to separate those in my mind. The only way the Microsoft
SLIST can get away with accessing deleted memory is SEH. What about accessing the next node from deleted memory? LL/SC does not need ABA counters, last time I checked, but it can have an issue with deleted
memory? What am I missing here Joe? I guess it depends on what the underlying allocator actually does with the deleted node...
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 5/7/2026 2:34 PM, jseigh wrote:
On 5/7/26 00:50, Chris M. Thomasson wrote:
On 5/6/2026 3:30 PM, jseigh wrote:
I was messing about with lock-free queue implementations and realized >>>>> you can do a Michael-Scott lock-free queue implementation using LL/SC, >>>>> which is an advantage since you don't need deferred reclamation that >>>>> CAS implementations require and which can slow things down.
How? Say using it raw with dynamic nodes. The user deletes a node. How >>>> does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.
Dequeuing from head is straightforward.-a I don't think I need to
describe that.
Enqueuing uses the old Double Tap trick.
1) load locked from tail node next pointer.
2) if null and tail still (double tap) points to node,
-a-a store conditional new node address into the next pointer.
Updating tail by doing a load locked on it,
loading the next pointer from tail node, and
updating tail.
It's ok to access deleted memory and long as you don't
actually use it.-a The old lock-free stack using
DCAS did that when unsuccessfully popping the stack.
It may have had an invalid value but the DCAS will fail.
You mean DWCAS? For the Michael-Scott lock-free queue; it did not need
DCAS. I try to separate those in my mind. The only way the Microsoft
SLIST can get away with accessing deleted memory is SEH. What about
accessing the next node from deleted memory? LL/SC does not need ABA
counters, last time I checked, but it can have an issue with deleted
memory? What am I missing here Joe? I guess it depends on what the
underlying allocator actually does with the deleted node...
a) I can't seem to find a definition of SEH instruction
b) LL/SC is not subject to ABA when SC fails on any control transfer
out of thread {between LL and SC}; and this is one primary reason
that LL/SC can fail spuriously.
c) if deleted memory has been freed--all bets are off.
Kevin Bowling wrote:
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the
specific
reason they add CAS to the architecture?
Scalability.
|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a|e-a Moving the contention detection to
the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et
alia
instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how valid
it -a|e-a is.|e-a LL/SC provides a very flexible "framework" for
implementing whatever atomic operation seems right for a particular
application, but the atomic operations are more efficient if you want
to do exactly what they do.
Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions.|e-a If you wanted the functionality of atomic fetch and
add, you would have had to do a TS instruction, followed by a load,
then an add, then a store and finally a clear test and set - five
instructions. Now think about the same thing if you had LL/SC.|e-a It
would be LL, add, SC - three instructions.|e-a Of course, if you had
the atomics, it would be one instruction.|e-a Of course, a similar
argument applies to a later generation of systems with respect to
CAS/DCAS.
So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives
you better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design cost
of having both is a separate discussion.
Cavium cnMIPS (OCTEONII/III) implement both.-a The LL/SC has lower
latency for the uncontested path.-a The CAS hit the L2.-a I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed
progress?
XADD is the better version of CAS, imho.
It allows actual progress for multiple contending threads, with just the minimum-possible delay for transferring ownership.
To improve on it, I think you need a distributed arbiter that can handle multiple incoming requests and send back the correct response to all of them, with zero actual memory transfers. I.e. all such semaphor memory addresses would actually end up inside the arbiter, so that it could
respond as if it was just RAM but with far lower latency.
The question is who makes that kind of memory controller?
CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel) processor chips.
On 5/6/26 10:19 AM, Scott Lurndal wrote:
[snip]
CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel)
processor chips.
Since idiom recognition allows a software LL/op/SC to be
translated into an internal atomic operation, I do not see
specific atomic instructions as "definitely a better choice".
One could argue that code density, decode simplicity, and
simpler detection of non-support (illegal instruction exception)
give a significant advantage to atomic instructions, but I
do not perceive atomic instructions as obviously better.
Non-support of idiom recognition could be handled either by
architectural requirement (for a version if not so defined
initially) or software probing for the feature. That may be
a sufficient solution for that issue.
Paul Clayton <paaronclayton@gmail.com> writes:
On 5/6/26 10:19 AM, Scott Lurndal wrote:
[snip]
CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel)
processor chips.
Since idiom recognition allows a software LL/op/SC to be
translated into an internal atomic operation, I do not see
specific atomic instructions as "definitely a better choice".
Please elaborate. There are few restrictions on the instructions
that lie between the LL and SC instructions - I don't see how
any CPU could translate an arbitrary sequence of instructions
between the LL and SS into an atomic bus operation efficiently.
One could argue that code density, decode simplicity, and
simpler detection of non-support (illegal instruction exception)
give a significant advantage to atomic instructions, but I
do not perceive atomic instructions as obviously better.
Non-support of idiom recognition could be handled either by
architectural requirement (for a version if not so defined
initially) or software probing for the feature. That may be
a sufficient solution for that issue.
Atomic instructions are already available for all existing
mainstream CPUs, so your idea only applies to new CPU designs.
IMO, LL/SC is an obsolete artifact of the past.
On 5/11/26 10:38 AM, Scott Lurndal wrote:
Paul Clayton <paaronclayton@gmail.com> writes:
On 5/6/26 10:19 AM, Scott Lurndal wrote:
[snip]
CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a >>> better choice than LDEX/STREX on ARM-based (and for that matter Intel) >>> processor chips.
Since idiom recognition allows a software LL/op/SC to be
translated into an internal atomic operation, I do not see
specific atomic instructions as "definitely a better choice".
Please elaborate. There are few restrictions on the instructions
that lie between the LL and SC instructions - I don't see how
any CPU could translate an arbitrary sequence of instructions
between the LL and SS into an atomic bus operation efficiently.
The hardware implementation can choose which LL/SC-guarded
operations to export,
which to optimize into a fast path within
the processor, and which to treat conventionally. Even in a more
conventional implementation, NAKs or deferred responses might be
used to promote forward progress.
This does require software developers to monitor what
optimizations are implemented, at least if there are
alternatives with possibly more desired performance
characteristics.
In general, if an ISA provides an atomic
operation and LL/SC idioms, they should have very similar
behavior outside of the front end. (There may be an argument
for different behavior such that software might have more
options that may be meaningful in some circumstances.)
One could argue that code density, decode simplicity, and
simpler detection of non-support (illegal instruction exception)
give a significant advantage to atomic instructions, but I
do not perceive atomic instructions as obviously better.
Non-support of idiom recognition could be handled either by
architectural requirement (for a version if not so defined
initially) or software probing for the feature. That may be
a sufficient solution for that issue.
Atomic instructions are already available for all existing
mainstream CPUs, so your idea only applies to new CPU designs.
Even with atomic instructions, I get the impression that the
explicit implementation (performance/scaling) is not
architecturally defined.
An atomic instruction might be
implemented with LL/SC with a guarantee of eventual success
(which would hopefully not be as bad as some x86 global lock for
cache block crossing LOCKed instructions).
(AArch64's STADD does not guarantee that the addition will be
done in the cache hierarchy even on a cache miss. The
architecture merely guarantees that the operation will be
atomic. An implementation could optimistically use an LL/SC-
based mechanism and fall back to locking rather than just
monitoring the reservation to ensure forward progress. With
out-of-order execution, the actual store to shared memory has
to be delayed until it is no longer speculative anyway,
replaying an atomic operation can be faster than a branch
misprediction rCo and even a branch misprediction can be fast
compared to communication between caches.)
IMO, LL/SC is an obsolete artifact of the past.
I disagree. I _feel_ LL/SC is a nice abstract interface that--- Synchronet 3.22a-Linux NewsLink 1.2
not only allows high-performance implementations of simple
atomics without requiring new software but can also (in theory)
be extended to multiple reservations (like My 66000's ESM) and
even to very general transactional memory. (I think a better
interface is possible with easier decode, better code density,
and the opportunity for hints and/or directives, but such would
introduce other costs.)
I see specific atomic operations as somewhat attractive (idiom
recognition is nice but it is not free), but potentially
susceptible to an excessive expansion of instructions. (SIMD
has similar tradeoffs. I like SIMD, but it has issues.)
On 5/11/26 10:38 AM, Scott Lurndal wrote:
Paul Clayton <paaronclayton@gmail.com> writes:
On 5/6/26 10:19 AM, Scott Lurndal wrote:
[snip]
CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a >>>> better choice than LDEX/STREX on ARM-based (and for that matter Intel) >>>> processor chips.
Since idiom recognition allows a software LL/op/SC to be
translated into an internal atomic operation, I do not see
specific atomic instructions as "definitely a better choice".
Please elaborate.-a There are few restrictions on the instructions
that lie between the LL and SC instructions - I don't see how
any CPU could translate an arbitrary sequence of instructions
between the LL and SS into an atomic bus operation efficiently.
The hardware implementation can choose which LL/SC-guarded operations to export, which to optimize into a fast path within the processor, and
which to treat conventionally. Even in a more
conventional implementation, NAKs or deferred responses might be
used to promote forward progress.
This does require software developers to monitor what
optimizations are implemented, at least if there are
alternatives with possibly more desired performance
characteristics. In general, if an ISA provides an atomic
operation and LL/SC idioms, they should have very similar
behavior outside of the front end. (There may be an argument
for different behavior such that software might have more
options that may be meaningful in some circumstances.)
One could argue that code density, decode simplicity, and
simpler detection of non-support (illegal instruction exception)
give a significant advantage to atomic instructions, but I
do not perceive atomic instructions as obviously better.
Non-support of idiom recognition could be handled either by
architectural requirement (for a version if not so defined
initially) or software probing for the feature. That may be
a sufficient solution for that issue.
Atomic instructions are already available for all existing
mainstream CPUs, so your idea only applies to new CPU designs.
Even with atomic instructions, I get the impression that the
explicit implementation (performance/scaling) is not
architecturally defined. An atomic instruction might be implemented with LL/SC with a guarantee of eventual success
(which would hopefully not be as bad as some x86 global lock for
cache block crossing LOCKed instructions).
(AArch64's STADD does not guarantee that the addition will be
done in the cache hierarchy even on a cache miss. The
architecture merely guarantees that the operation will be
atomic. An implementation could optimistically use an LL/SC-
based mechanism and fall back to locking rather than just
monitoring the reservation to ensure forward progress. With
out-of-order execution, the actual store to shared memory has
to be delayed until it is no longer speculative anyway,
replaying an atomic operation can be faster than a branch
misprediction rCo and even a branch misprediction can be fast compared to communication between caches.)
IMO, LL/SC is an obsolete artifact of the past.
I disagree. I _feel_ LL/SC is a nice abstract interface that
not only allows high-performance implementations of simple
atomics without requiring new software but can also (in theory)
be extended to multiple reservations (like My 66000's ESM) and
even to very general transactional memory. (I think a better
interface is possible with easier decode, better code density,
and the opportunity for hints and/or directives, but such would
introduce other costs.)
I see specific atomic operations as somewhat attractive (idiom
recognition is nice but it is not free), but potentially
susceptible to an excessive expansion of instructions. (SIMD
has similar tradeoffs. I like SIMD, but it has issues.)
On 5/11/26 10:38 AM, Scott Lurndal wrote:
Paul Clayton <paaronclayton@gmail.com> writes:
Atomic instructions are already available for all existing
mainstream CPUs, so your idea only applies to new CPU designs.
Even with atomic instructions, I get the impression that the
explicit implementation (performance/scaling) is not
architecturally defined. An atomic instruction might be
implemented with LL/SC with a guarantee of eventual success
(which would hopefully not be as bad as some x86 global lock for
cache block crossing LOCKed instructions).
(AArch64's STADD does not guarantee that the addition will be
done in the cache hierarchy even on a cache miss. The
architecture merely guarantees that the operation will be
atomic. An implementation could optimistically use an LL/SC-
based mechanism and fall back to locking rather than just
monitoring the reservation to ensure forward progress. With
out-of-order execution, the actual store to shared memory has
to be delayed until it is no longer speculative anyway,
replaying an atomic operation can be faster than a branch
misprediction rCo and even a branch misprediction can be fast
compared to communication between caches.)
IMO, LL/SC is an obsolete artifact of the past.
I disagree. I _feel_ LL/SC is a nice abstract interface that
not only allows high-performance implementations of simple
atomics without requiring new software but can also (in theory)
be extended to multiple reservations (like My 66000's ESM) and
even to very general transactional memory. (I think a better
interface is possible with easier decode, better code density,
and the opportunity for hints and/or directives, but such would
introduce other costs.)
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 65 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 03:53:38 |
| Calls: | 862 |
| Files: | 1,311 |
| D/L today: |
740 files (8,156M bytes) |
| Messages: | 264,528 |