Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 43 |
Nodes: | 6 (0 / 6) |
Uptime: | 104:23:53 |
Calls: | 290 |
Files: | 905 |
Messages: | 76,612 |
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?
LL/SC vs cmpxchg8b?
Arm A64 has LDXP Load Exclusive Pair of registers and
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
On 11/8/2024 6:19 AM, Scott Lurndal wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
A real world example from the linux kernel:
static __always_inline s64
__ll_sc_atomic64_dec_if_positive(atomic64_t *v)
{
s64 result;
unsigned long tmp;
asm volatile("// atomic64_dec_if_positive\n"
" prfm pstl1strm, %2\n"
"1: ldxr %0, %2\n"
" subs %0, %0, #1\n"
" b.lt 2f\n"
" stlxr %w1, %0, %2\n"
" cbnz %w1, 1b\n"
" dmb ish\n"
"dmb ish" is interesting to me for some reason...
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/8/2024 6:19 AM, Scott Lurndal wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
A real world example from the linux kernel:
static __always_inline s64
__ll_sc_atomic64_dec_if_positive(atomic64_t *v)
{
s64 result;
unsigned long tmp;
asm volatile("// atomic64_dec_if_positive\n"
" prfm pstl1strm, %2\n"
"1: ldxr %0, %2\n"
" subs %0, %0, #1\n"
" b.lt 2f\n"
" stlxr %w1, %0, %2\n"
" cbnz %w1, 1b\n"
" dmb ish\n"
"dmb ish" is interesting to me for some reason...
Data Memory Barrior - inner sharable coherency domain
It reads better without explanation ...
On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:
It reads better without explanation ...
Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).
Can anybody find any other example of any IBM engineer ever having a
sense of humour?
Ever?
Anybody?
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead ofLL/SC vs cmpxchg8b?
pessimistic CAS RMW type of logic?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead ofLL/SC vs cmpxchg8b?
pessimistic CAS RMW type of logic?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Thanks, I missed that.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction?
Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC insteadLL/SC vs cmpxchg8b?
of pessimistic CAS RMW type of logic?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Thanks, I missed that.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead ofLL/SC vs cmpxchg8b?
pessimistic CAS RMW type of logic?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Thanks, I missed that.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence >synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>> pessimistic CAS RMW type of logic?LL/SC vs cmpxchg8b?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Thanks, I missed that.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction?
The advantage is consuming OpCode space at breathtaking speed.
Oh wait...
Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
Because the memory model was not build with the notion of memory order
and that not all ATOMIC events start or end with a recognizable inst- >ruction. Having ATOMICs announce their beginning and ending eliminates
the need for fencing; even if you keep a <relatively> relaxed memory
order model.
There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 10 Nov 2024 21:00:23 +0000, EricP wrote:
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Chris M. Thomasson wrote:
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SCLL/SC vs cmpxchg8b?
instead of pessimistic CAS RMW type of logic?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
Aarch64 also has CASP, a 128-bit atomic compare and swap
instruction.
Thanks, I missed that.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction?
The advantage is consuming OpCode space at breathtaking speed.
Oh wait...
Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
Because the memory model was not build with the notion of memory
order and that not all ATOMIC events start or end with a
recognizable inst- ruction. Having ATOMICs announce their beginning
and ending eliminates the need for fencing; even if you keep a
<relatively> relaxed memory order model.
There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:Thanks, I missed that.
Chris M. Thomasson wrote:Aarch64 also has CASP, a 128-bit atomic compare and swap
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>> pessimistic CAS RMW type of logic?LL/SC vs cmpxchg8b?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
instruction.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
"Limited ordering regions allow large systems to perform
special Load-Acquire and Store-Release instructions that
provide order between the memory accesses to a region of
the PA map as observed by a limited set of observers."
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:Thanks, I missed that.
Chris M. Thomasson wrote:Aarch64 also has CASP, a 128-bit atomic compare and swap
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:Arm A64 has LDXP Load Exclusive Pair of registers and
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of >>>>>>> pessimistic CAS RMW type of logic?LL/SC vs cmpxchg8b?
STXP Store Exclusive Pair of registers looks like it can be
equivalent to cmpxchg16b (aka double-wide compare and swap).
instruction.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
"Limited ordering regions allow large systems to perform
special Load-Acquire and Store-Release instructions that
provide order between the memory accesses to a region of
the PA map as observed by a limited set of observers."
Ok, so that explains LoadLOAcquire, StoreLORelease as they are
functionally different: it needs to associate the fence with specific
load and store addresses so it can determine a physical LORegion,
if any, and thereby limit the scope of the fence actions to that LOR.
But that doesn't explain Load-Acquire, Load-AcquirePC, and Store-Release.
Why attach a specific kind of fence action to the general LD or ST?
They do the same thing in the atomic instructions, eg:
On 11/11/24 08:59, Scott Lurndal wrote:
There are fully atomic instructions, the load/store exclusives are
generally there for backward compatability with armv7; the full set
of atomics (SWP, CAS, Atomic Arithmetic Ops, etc) arrived with
ARMv8.1.
They added the atomics for scalability allegedly. ARM never
stated what the actual issue was. I suspect they couldn't
guarantee a memory lock size small enough to eliminate
destructive interference. Like cache line size instead
of word size.
On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:
It reads better without explanation ...
Reminds me of the “EIEIO” instruction from IBM POWER (or was it only PowerPC).
Can anybody find any other example of any IBM engineer ever having a sense
of humour? Ever?
aph@littlepinkcloud.invalid writes:------------------------
Kent Dickey <kegs@provalid.com> wrote:
So it seems. I think everything in DDI0487J was meant to be there in >>DDI0487K, but it looks like it's all been macro-expanded and some
things fell off the page, because reasons.
Between DDI0487G and DDI0487H, they completely rewrote the ARM
using a requirements based description rather than the straightforward
prose in prior editions.
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
The execution time of each is the same, and the main cost is the fence synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
On 2024-11-10, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
On Sun, 10 Nov 2024 01:26:22 +0000, MitchAlsup1 wrote:
It reads better without explanation ...
Reminds me of the “EIEIO” instruction from IBM POWER (or was it only
PowerPC).
Can anybody find any other example of any IBM engineer ever having a sense >> of humour? Ever?
One of the resource types in JES2, the batch subsystem for z/OS, is
BERT ("Block Extension Reuse Table") and needs some sizing/tuning by
the sysprog. Not too noticeable as humourous but for low-level use
from Assembler some of the macros which manipulate them allow you to
(1) copy one into memory, i.e. "Deliver Or Get" a BERT
(2) define a hook to get control when a BERT is released, i.e
"Do It Later" for a BERT release.
(3) generate a control block for a related data area, i.e. a
"Collector Attribute Table" for BERTs.
These macros are
(1) $DOGBERT
(2) $DILBERT
(3) $CATBERT
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Andrew.
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Andrew.
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 11/12/2024 4:14 AM, aph@littlepinkcloud.invalid wrote:
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
Fwiw, even x86/x64 needs StoreLoad when an algorithm depends on a
store followed by a load to another location to hold. LoadStore is
not strong enough. The SMR algorithm needs that. Iirc, Peterson's
algorithms needs it as well.
That's right, but my point about LDAR on AArch64 is that you can get sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
aph@littlepinkcloud.invalid wrote:
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Isn't this just reusing the normal forwarding network?
If not found, you do as usual and start a regular load operation, but
now you also know that you can skip the flushing of the same?
PS. I do agree that it is a good idea (even patent-worthy?), but not brilliant since it is so very obvious in hindsight.
To me brilliant is something that still isn't obvious after larning
about it.
On 11/12/24 18:02, aph@littlepinkcloud.invalid wrote:
That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Does ARM use acquire and release differently than everyone else?
I'm not sure where StoreLoad fits in with those.
Do read B2.3 Definition of the Arm memory model. It's only 32 pages,
and very clearly defines the memory model.
Even better, let's look at the actual words for Pick Basic Dependency:
---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory
from E1 to E2
b) E1 and E2 are the same effect
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
aph@littlepinkcloud.invalid wrote:
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
That's right, but my point about LDAR on AArch64 is that you can get
sequential consistency without needing a StoreLoad. LDAR can peek
inside the store buffer and, much of the time, determine that it isn't
necessary to do a flush. I don't know if Arm were the first to do
this, but I don't recall seeing it before. It is a brilliant idea.
Isn't this just reusing the normal forwarding network?
If not found, you do as usual and start a regular load operation, but
now you also know that you can skip the flushing of the same?
Yes. As long as the data in the store buffer doesn't overlap with what
you're about to write, you can ship the flushing.
PS. I do agree that it is a good idea (even patent-worthy?), but not
brilliant since it is so very obvious in hindsight.
LOL! :-)
To me brilliant is something that still isn't obvious after larning
about it.
You have very high standards.
Kent Dickey <kegs@provalid.com> wrote:
Even better, let's look at the actual words for Pick Basic Dependency:
---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory
from E1 to E2
b) E1 and E2 are the same effect
I don't understand this. However, here are the actual words:
Pick Basic dependency
A Pick Basic dependency from a read Register effect or read Memory
effect R1 to a Register effect or Memory effect E2 exists if one
of the following applies:
• There is a Dependency through registers and memory from R1 to E2.
• There is an Intrinsic Control dependency from R1 to E2.
• There is a Pick Basic dependency from R1 to an Effect E3 and
there is a Pick Basic dependency from E3 to E2.
Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.
Andrew.
So if were to implement a spinlock using the above instructions
something along the lines of
..L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.
Loads from the locked critical region could move forward of
the stxr but there's a control dependency from cbnz branch
instruction so they would be speculative loads until the
loop exited.
You'd still potentially have loads before the store of
the lockword but in this case that's not a problem
since it's known the lockword was 0 and no stores
from prior locked code could occur.
This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.
Is this sort of what's going on or is the explicit
acquire memory barrier still needed?
Joe Seigh
On Mon, 28 Oct 2024 19:13:03 +0000, jseigh wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
..L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.
Loads from the locked critical region could move forward of
the stxr but there's a control dependency from cbnz branch
instruction so they would be speculative loads until the
loop exited.
You'd still potentially have loads before the store of
the lockword but in this case that's not a problem
since it's known the lockword was 0 and no stores
from prior locked code could occur.
This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.
Is this sort of what's going on or is the explicit
acquire memory barrier still needed?
Joe Seigh
My guess is that so few of us understand ARM fence
mechanics that we cannot address teh asked question.
So if were to implement a spinlock using the above instructions
something along the lines of
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
On 11/14/2024 1:23 AM, aph@littlepinkcloud.invalid wrote:
Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.
data dependencies as in stronger than a Dec Alpha does not not honor
data dependent loads?
In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>, <aph@littlepinkcloud.invalid> wrote:
Kent Dickey <kegs@provalid.com> wrote:
Even better, let's look at the actual words for Pick Basic Dependency:
---
Pick Basic Dependency:
There is a Pick Basic dependency from an effect E1 to an effect
E2 if one of the following applies:
1) One of the following applies:
a) E1 is an Explicit Memory Read effect
b) E1 is a Register Read effect
2) One of the following applies:
a) There is a Pick dependency through registers and memory >>> from E1 to E2
b) E1 and E2 are the same effect
I don't understand this. However, here are the actual words:
Pick Basic dependency
A Pick Basic dependency from a read Register effect or read Memory
effect R1 to a Register effect or Memory effect E2 exists if one
of the following applies:
. There is a Dependency through registers and memory from R1 to E2. >> . There is an Intrinsic Control dependency from R1 to E2.
. There is a Pick Basic dependency from R1 to an Effect E3 and
there is a Pick Basic dependency from E3 to E2.
Seems reasonable enough in context, no? It's either a data dependency,
a control dependency, or any transitive combination of them.
Where did you get that from? I cannot find it in the current Arm document DDI0487K_a_a-profile-architecture_reference_manual.pdf. Get it from https://developer.arm.com/documentation/ddi0487/ka/?lang=en
My text for Pick Basic dependency is a quote (where I label the lines
1a,1b, etc., where it's just bullets in the Arm document) from page B2-239, middle of the page.
That sort of "summary" was exactly what I was asking for, but I don't see it, so can you please name the page?
I'm pretty sure there are confusing typos all through this section
(E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
was a doozy.
It's likely the wording was better in an earlier document, I've noticed
this section getting more opaque over time.
So if were to implement a spinlock using the above instructions
something along the lines of
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The "lock" operation has memory order acquire semantics and
we see that in part in the ldaxr but the store isn't part
of that. We could append an additional acquire memory barrier
but would that be necessary.
This should be analogous to rmw atomics like CAS but
I've no idea what the internal hardware implementations
are. Though on platforms without CAS the C11 atomics
are implemented with LD/SC logic.
Is this sort of what's going on or is the explicit
acquire memory barrier still needed?
On Mon, 28 Oct 2024 15:13:03 -0400, jseigh wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
.L0
ldaxr -- load lockword exclusive w/ acquire membar
cmp -- compare to zero
bne .LO -- loop if currently locked
stxr -- store 1
cbnz .LO -- retry if stxr failed
The closest I could find to this was on page 8367
of DDI0487G_a_armv8_arm.pdf from infocenter.arm.com:
Loop
LDAXR W5, [X1] ; read lock with acquire
CBNZ W5, Loop ; check if 0
STXR W5, W0, [X1] ; attempt to store new value
CBNZ W5, Loop ; test if store succeeded and retry if not
On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
jseigh <jseigh_es00@xemaps.com> wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?
Is this basically right? Or, what am I missing here? Thanks.
The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.
Am I missing anything wrt ARM? ;^o
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?
LL/SC vs cmpxchg8b?
On 11/8/2024 2:45 PM, Scott Lurndal wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/2/2024 12:10 PM, Chris M. Thomasson wrote:
On 11/1/2024 9:17 AM, aph@littlepinkcloud.invalid wrote:
jseigh <jseigh_es00@xemaps.com> wrote:
So if were to implement a spinlock using the above instructions
something along the lines of
Fwiw, I am basically asking if the "store" stxr has implied acquire
semantics wrt the "load" ldaxr? I am guess that it does... This would
imply that the acquire membar (#LoadStore | #LoadLoad) would be
respected by the store at stxr wrt its "attached?" load wrt ldaxr?
Is this basically right? Or, what am I missing here? Thanks.
The membar logic wrt acquire needs to occur _after_ the atomic logic
that locks the spinlock. A release barrier (#LoadStore | #StoreStore)
needs to occur _before_ the atomic logic that unlocks said spinlock.
Am I missing anything wrt ARM? ;^o
Did you read the extensive description of memory semantics
in the ARMv8 ARM? See page 275 in DDI0487K_a.
https://developer.arm.com/documentation/ddi0487/ka/?lang=en
I did not! So I am flying a mostly blind here. I don't really have any experience with how ARM handles these types of things. Just guessing
that the store would honor the acquire of the load? Or, does the store
need a membar and the load does not need acquire at all? I know that the membar should be after the final store that actually locks the spinlock
wrt Joe's example.
I just need to RTFM!!!!
Sorry about that Scott. ;^o
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of pessimistic CAS RMW type of logic?
On 11/8/2024 2:56 PM, Chris M. Thomasson wrote:
Perhaps sometime tonight. Is seems like optimistic LL/SC instead of
pessimistic CAS RMW type of logic?
LL/SC vs cmpxchg8b?
Kent Dickey <kegs@provalid.com> wrote:
In article <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>,
<aph@littlepinkcloud.invalid> wrote:
Kent Dickey <kegs@provalid.com> wrote:
Even better, let's look at the actual words for Pick Basic Dependency:
That sort of "summary" was exactly what I was asking for, but I don't see it,
so can you please name the page?
B2-174 in DDI0487J
I'm pretty sure there are confusing typos all through this section
(E2 and E3 getting mixed up, for example), but that Pick Basic Dependency
was a doozy.
It's likely the wording was better in an earlier document, I've noticed
this section getting more opaque over time.
So it seems. I think everything in DDI0487J was meant to be there in >DDI0487K, but it looks like it's all been macro-expanded and some
things fell off the page, because reasons.