Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 30 |
Nodes: | 6 (1 / 5) |
Uptime: | 68:45:11 |
Calls: | 414 |
Calls today: | 1 |
Files: | 1,015 |
Messages: | 94,393 |
Posted today: | 1 |
On 2025-03-15 6:28 p.m., MitchAlsup1 wrote:
On Sat, 15 Mar 2025 20:46:41 +0000, Robert Finch wrote:
On 2025-03-15 1:46 p.m., MitchAlsup1 wrote:------------
On Sat, 15 Mar 2025 2:10:48 +0000, Robert Finch wrote:
The SoC system is limited to 64 CPU cores by the use of a 6-bit core
number. So, a 64-bit vector containing which cores to send the interrupt >>> to could be used. A find-first-one responding could be used to mask
things so that only one CPU sees the interrupt.
Consider that you have a pool of 4 cores setup to receive interrupts.
That those 4 cores are running at differing priorities and the interrupt
is at a still different priority. You want the core operating at the
lowest
priority (with the right software stack) to accept the interrupt !!
Okay, I wrote a naive hardware filter for this which should work okay
for small numbers of CPUs but does not scale up very well.
What happens if there is no core ready? Place the IRQ back into the
queue (regenerate the IRQ as an IRQ miss IRQ)? Or just wait for an
available core. I do not like the idea of waiting as it could stall the system.
To identify the CPU pool, with a max of 63 CPU per interrupt controller,
the pool could be represented directly as a 64-bit word in the interrupt vector table. I have been thinking of using a 256-bit vector entry
instead of 128-bits. I allocated 96-bit for an instruction / address
field.
MitchAlsup1 wrote:
On Thu, 20 Mar 2025 12:50:09 +0000, Dan Cross wrote:
In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
But let me turn it around, what real-world problem does this
interrupt prioritization scheme solve coupled with the inability
to disable interrupts solve?
The OS does not have to "walk" its schedule queues in order to
context switch.
The SVR (SuperVisor Return) instruction checks the priority of
the thread that had been interrupted, and if it is of lower
priority than what is pending in the interrupt table, the
one on the interrupt table receives control while the lower
priority thread remains interrupted.
Thus, the return path from an ISR, just does an SVR. If a
DPC/softIRQ has been setup, or another interrupt has become
pending, HW chooses the path--acting as if HW returned control
to the interrupted thread, and then immediately took the inter-
rupt of highest priority still pending--without having executed
a single instruction in the interrupted thread.
Is anyone complaining that the
ability to disable interrupts is somehow a serious burden?
This thread started with the notion of MSI interrupts and the
inherent need to DI upon arrival at ISR dispatcher. This sub-
thread is investigating whether one can perform ISRs without
disabling interrupts. Thus, making IDisablement an unnecessary
part of interrupt servicing itself.
Your (and EricP)'s contribution is to annotate those instances
where one can arrive at an ISR IE and still want/need to DI and EI
independent of the control transfer from and back to interrupted
thread (even if control arrives reentrantly from interrupted thread).
Supervisor return, whatever it is called - sysret, rei, svr -
is traditionally another point where a number of OS race conditions
can force it to briefly disable interrupts.
The classic case is when a DPC/SoftIrq may be posted by an Interrupt
Service Routine (ISR) and causes the First Level Interrupt Handler
(FLIH),
which is the ISR nested first on the stack of priority interrupts,
to exit into the OS rather than return to its prior thread.
The race condition occurs between the point when the FLIH ISR epilogue
checks for pending a pending DPC, and not finding one decides to return
to the prior thread. Between that check and the SVR instruction a higher priority interrupt occurs and posts a DPC.
Another race condition can be how does an epilogue know if it is a FLIH
or nested one? A counter for each core incremented by the ISR prologue
and checked by the epilogue would do but what if the nested interrupt
catches a lower level prologue just before it increments the counter.
There are a number of non-interruptible checks and control flag changes
that supervisor return must perform.
To do this I have a single instruction CSVR Conditional Supervisor
Return
which handles all interrupt, exception, and SysCall returns. It performs
a model specific test on an interrupt control register and if True
changes
the ICR state and returns to the prior context. If False makes other ICR state changes and falls through the CSVR instruction where code can
determine what just changed. This would typically be a jump into the OS Dispatcher to flush any DPC/SoftIrq and check for thread rescheduling.
This whole return sequence is performed without having to disable
interrupts.
On Thu, 20 Mar 2025 12:50:09 +0000, Dan Cross wrote:
In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
But let me turn it around, what real-world problem does this
interrupt prioritization scheme solve coupled with the inability
to disable interrupts solve?
The OS does not have to "walk" its schedule queues in order to
context switch.
The SVR (SuperVisor Return) instruction checks the priority of
the thread that had been interrupted, and if it is of lower
priority than what is pending in the interrupt table, the
one on the interrupt table receives control while the lower
priority thread remains interrupted.
Thus, the return path from an ISR, just does an SVR. If a
DPC/softIRQ has been setup, or another interrupt has become
pending, HW chooses the path--acting as if HW returned control
to the interrupted thread, and then immediately took the inter-
rupt of highest priority still pending--without having executed
a single instruction in the interrupted thread.
Is anyone complaining that the
ability to disable interrupts is somehow a serious burden?
This thread started with the notion of MSI interrupts and the
inherent need to DI upon arrival at ISR dispatcher. This sub-
thread is investigating whether one can perform ISRs without
disabling interrupts. Thus, making IDisablement an unnecessary
part of interrupt servicing itself.
Your (and EricP)'s contribution is to annotate those instances
where one can arrive at an ISR IE and still want/need to DI and EI independent of the control transfer from and back to interrupted
thread (even if control arrives reentrantly from interrupted thread).
Consider that you have a pool of 4 cores setup to receive interrupts.
That those 4 cores are running at differing priorities and the interrupt
is at a still different priority. You want the core operating at the
lowest
priority (with the right software stack) to accept the interrupt !!
Okay, I wrote a naive hardware filter for this which should work okay
for small numbers of CPUs but does not scale up very well.
What happens if there is no core ready? Place the IRQ back into the
queue (regenerate the IRQ as an IRQ miss IRQ)? Or just wait for an
available core. I do not like the idea of waiting as it could stall the >system.
Robert Finch <robfi680@gmail.com> writes:
<please trim posts>
Consider that you have a pool of 4 cores setup to receive
interrupts. That those 4 cores are running at differing priorities
and the interrupt is at a still different priority. You want the
core operating at the lowest
priority (with the right software stack) to accept the interrupt
!!
Okay, I wrote a naive hardware filter for this which should work
okay for small numbers of CPUs but does not scale up very well.
What happens if there is no core ready? Place the IRQ back into the
queue (regenerate the IRQ as an IRQ miss IRQ)? Or just wait for an >available core. I do not like the idea of waiting as it could stall
the system.
Use a fifo to hold up to N pending IRQ. Define a signal that asserts
when the fifo is non-empty. The CPU can mask the signal to prevent interruption, when the signal is unmasked the CPU pops the
first IRQ from the FIFO. Or use a bitmap in flops or SRAM
(prioritization happensin the logic that asserts the fact
hat an interrupt is pending to the CPU).
Choose whether you want a FIFO/bitmap per target CPU, or global
pending data with the logic to select highest priority pending
IRQ when the CPU signals that interrups are not masked.
On Mon, 17 Mar 2025 13:38:12 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Robert Finch <robfi680@gmail.com> writes:
<please trim posts>
Consider that you have a pool of 4 cores setup to receive
interrupts. That those 4 cores are running at differing priorities
and the interrupt is at a still different priority. You want the
core operating at the lowest
priority (with the right software stack) to accept the interrupt
!!
Okay, I wrote a naive hardware filter for this which should work
okay for small numbers of CPUs but does not scale up very well.
What happens if there is no core ready? Place the IRQ back into the
queue (regenerate the IRQ as an IRQ miss IRQ)? Or just wait for an
available core. I do not like the idea of waiting as it could stall
the system.
Use a fifo to hold up to N pending IRQ. Define a signal that asserts
when the fifo is non-empty. The CPU can mask the signal to prevent
interruption, when the signal is unmasked the CPU pops the
first IRQ from the FIFO. Or use a bitmap in flops or SRAM
(prioritization happensin the logic that asserts the fact
hat an interrupt is pending to the CPU).
Choose whether you want a FIFO/bitmap per target CPU, or global
pending data with the logic to select highest priority pending
IRQ when the CPU signals that interrups are not masked.
The problem Robert is talking about arises when there are many
interrupt source and many target CPUs.
The required routing/prioritization/acknowledgment logic (at least a
naive logic I am having in mind) would be either non-scalable or
relatively complicated.
On 2025-03-17 9:38 a.m., Scott Lurndal wrote:
Robert Finch <robfi680@gmail.com> writes:
Been reading up on the RISCV AIA.
For RISCV the CPU core handling the interrupt is indirectly specified by
the I/O address set in the device as there is an IMSIC for each RISCV
hart. So only a single IMSIC would be responding at that address. It
would be up to software to set the I/O address to pick a particular CPU
core. It seems to me that MSI interrupt handling using AIA would be >heavy-weight. Having a controller for every hart could be a lot of logic
if there are a lot of harts. There may be more than one hart per CPU.
Reading that the I/O addresses need to be translated by an IOMMU. In Q+
since an interrupt controller number is being used instead of an I/O
address there is no need to translate the address. Software needs to >determine which group of cores will be handling the interrupt, and
select an interrupt controller appropriately.
The Q+ controller looks a bit like an IMSIC controller as it has pending
and enable bits. I note that there is an enable bit in the vector table
and an enable bit array. So, I assume both bits must be enabled for an >interrupt to occur.
On Mon, 17 Mar 2025 13:38:12 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Robert Finch <robfi680@gmail.com> writes:
<please trim posts>
Use a fifo to hold up to N pending IRQ. Define a signal that assertsConsider that you have a pool of 4 cores setup to receiveOkay, I wrote a naive hardware filter for this which should work
interrupts. That those 4 cores are running at differing priorities
and the interrupt is at a still different priority. You want the
core operating at the lowest
priority (with the right software stack) to accept the interrupt
!!
okay for small numbers of CPUs but does not scale up very well.
What happens if there is no core ready? Place the IRQ back into the
queue (regenerate the IRQ as an IRQ miss IRQ)? Or just wait for an
available core. I do not like the idea of waiting as it could stall
the system.
when the fifo is non-empty. The CPU can mask the signal to prevent
interruption, when the signal is unmasked the CPU pops the
first IRQ from the FIFO. Or use a bitmap in flops or SRAM
(prioritization happensin the logic that asserts the fact
hat an interrupt is pending to the CPU).
Choose whether you want a FIFO/bitmap per target CPU, or global
pending data with the logic to select highest priority pending
IRQ when the CPU signals that interrups are not masked.
The problem Robert is talking about arises when there are many
interrupt source and many target CPUs.
The required routing/prioritization/acknowledgment logic (at least a
naive logic I am having in mind) would be either non-scalable or
relatively complicated. Process of selection for the second case will
take multiple cycles (I am thinking about ring).
Robert Finch <robfi680@gmail.com> writes:
Here's a copy of a slide I did a decade ago for ARM64 (GICv3)
interrupts, not much has changed since.
Interrupt Gates
There are a number of gates that must be open to pass an interrupt
from the source to the destination Core
Michael S wrote:------------------
On Mon, 17 Mar 2025 13:38:12 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
The problem Robert is talking about arises when there are many
interrupt source and many target CPUs.
The required routing/prioritization/acknowledgment logic (at least a
naive logic I am having in mind) would be either non-scalable or
relatively complicated. Process of selection for the second case will
take multiple cycles (I am thinking about ring).
Another problem is what does the core do with the in flight
instructions.
Method 1 is simplest, it injects the interrupt request at Retire
as that's where the state of everything is synchronized.
The consequence is that, like exceptions, the in flight instructions all
get purged, and we save the committed RIP, RSP and interrupt control
word.
While that might be acceptable for a 5 stage in-order pipeline,
it could be pretty expensive for an OoO 200+ instruction queue
potentially tossing hundreds of cycles of near finished work.
Method 2 pipelines the switch by injecting the interrupt request at
Fetch.
Decode converts the request to a special uOp that travels down the IQ
to Retire and allows all the older work to complete.
This is more complex as it requires a two phase hand-off from the
Interrupt Control Unit (ICU) to the core as a branch mispredict in the
in flight instructions might cause a tentative interrupt acceptance to
later be withdrawn.
The ICU believes the core is in a state to accept a higher priority interrupt. It sends a request to core, which checks its current state
and
sends back an immediate INT_ACK if _might_ accept and stalls Fetch, or a
NAK.
When the special uOp reaches Retire, it sends a signal to Fetch which
then sends an INT_ACCEPT signal to ICU to complete the handoff.
If a branch mispredict occurs that causes interrupts to be disabled,
then Fetch sends an INT_REJECT to ICU, and unstalls its fetching.
(Yes that is not optimal - make it work first, make it work well
second.)
This also raises a question about what the ICU is doing during this
long latency handoff. One wouldn't want ICU to sit idle so it might
have to manage the handoff of multiple interrupts to multiple cores
at the same time, each as its own little state machine.
One should see that this decision on how the core handles the
handoff has a large impact on the design complexity of the ICU.
On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:
Michael S wrote:------------------
On Mon, 17 Mar 2025 13:38:12 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
The problem Robert is talking about arises when there are many
interrupt source and many target CPUs.
The required routing/prioritization/acknowledgment logic (at least a
naive logic I am having in mind) would be either non-scalable or
relatively complicated. Process of selection for the second case will
take multiple cycles (I am thinking about ring).
Another problem is what does the core do with the in flight
instructions.
Method 1 is simplest, it injects the interrupt request at Retire
as that's where the state of everything is synchronized.
The consequence is that, like exceptions, the in flight instructions all
get purged, and we save the committed RIP, RSP and interrupt control
word.
While that might be acceptable for a 5 stage in-order pipeline,
it could be pretty expensive for an OoO 200+ instruction queue
potentially tossing hundreds of cycles of near finished work.
Lowest interrupt Latency
Highest waste of power (i.e., work)
Method 2 pipelines the switch by injecting the interrupt request at
Fetch.
Decode converts the request to a special uOp that travels down the IQ
to Retire and allows all the older work to complete.
This is more complex as it requires a two phase hand-off from the
Interrupt Control Unit (ICU) to the core as a branch mispredict in the
in flight instructions might cause a tentative interrupt acceptance to
later be withdrawn.
Interrupt latency is dependent on executing instructions,
Lowest waste of power
But note: In most cases, it already took the interrupt ~150 nanoseconds
to arrive at the interrupt service port. 1 trip from device to DRAM
(possibly serviced by L3), 1 trip from DRAM back to device, 1 tip from
device to interrupt service port; and 4 DRAM (or L3) accesses to log >interrupt into table.
On Thu, 20 Mar 2025 20:25:59 +0000, Dan Cross wrote:
In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:
In article <af2d54a7c6c694bf18bcca6e6876a72b@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:
In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
So that you can have some mechanism for building critical
sections that span multiple instructions, where just having
access to atomics and spin locks isn't enough.
You can do this with priority, too.
In the limit, disabling interrupts is just setting the priority
mask to infinity, so in some sense this is de facto true. But I
don't see how having a prioritization scheme is all that useful.
Consider that when you're starting the system up, you need to
do things like set up an interrupt handler vector, and so on;
The Interrupt vector is completely software. HW delivers the MSI-X >>>message and the service for which this message is for, and SW decides >>>what to do with this message to that service provider. No HW vectoring >>>but HW does deliver all the bits needed to guide SW to its goal.
Maybe I'm missing the context here, and apologies if I am. But
which interrupt disable are we talking about here, and where? I
We are talking about the core's Interrupt Disablement and nothing
of the gates farther out in the system that control interrupt
deliveries (or not).
am coming at this from the perspective of systems software
running on some CPU core somewhere; if I want that software to
enter a critical section that cannot be interrupted, then it
seems to me that MSI-X delivery is just a detail at that point.
I might want to prevent IPIs, or delivery or local interrupts
attached directly to the CPU (or something like an IOAPIC or
GIC distributor).
All of those flow through the same mechanism; even a locally
generated timer interrupt.
[snip/]
Sometimes you really don't want to be interrupted.
And sometimes you don't want to be interrupted unless the
"house is on fire"; I cannot see a time when "the house is
on fire" that you would not want to take the interrupt.
Is there one ?!?
Consider a thread that takes a spinlock; suppose some
high-priority interrupt comes in while the thread is holding
that lock. In response to the interrupt, software decides to
suspend the thread and switch some other thread; that thread
wants to lock the spin lock that the now-descheduled thread is
holding: a classic deadlock scenario.
If we can s/taken/attempts/ in the first line of that paragraph::
My architecture has a mechanism to perform ATOMIC stuff over multiple >instruction time frames that has the property that a higher priority
thread which interferers with a lower priority thread, the HPT wins
and the LPT fails its ATOMIC event. It is for exactly this reason
that I drag priority through the memory hierarchy--so that real-time
things remain closer to real-time (no or limited priority inversions).
Mandating to SW that typ SW must use this mechanism will not fly,
at least in the first several iterations of an OS for the system.
So the std. methods must remain available.
AND to a large extend this sub-thread is attempting to get a
usefully large number of them annotated.
A valid response here might be, "don't context switch from the
interrupt handler; use a DPC instead". That may be valid, but
it puts a constraint on the software designer that may be onerus
in practice: suppose the interrupt happens in the scheduler,
while examining a run queue or something. A DPC object must be
available, etc.
This seems to be onerous on SW mainly because of too many unknowns
to track down and deal with.
Further, software must now consider the complexity of
potentially interruptable critical sections. From the
standpoint of reasoning about already-complex concurrency issues
it's simpler to be able to assert that (almost) all interrupt
delivery can be cheaply disabled entirely, save for very
special, specific, scenarios like NMIs. Potentially switching
away from a thread holding a spinlock sort of defeats the
purpose of a spinlock in the first place, which is a mutex
primitive designed to avoid the overhead of switching.
Agreed.
In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
My architecture has a mechanism to perform ATOMIC stuff over multiple >>instruction time frames that has the property that a higher priority
thread which interferers with a lower priority thread, the HPT wins
and the LPT fails its ATOMIC event. It is for exactly this reason
that I drag priority through the memory hierarchy--so that real-time
things remain closer to real-time (no or limited priority inversions).
Without being able to see this in practice, it's difficult to
speculate as to how well it will actually work in real-world
scenarios. What is the scope of what's covered by this atomic
thing?
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
In this case, I'm probably going to want to take a lock on the
queue, then lock the list, then pop the first element off of the
queue by taking a pointer to the head.
On Sun, 23 Mar 2025 16:16:16 +0000, EricP wrote:
MitchAlsup1 wrote:
On Thu, 20 Mar 2025 12:50:09 +0000, Dan Cross wrote:
In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
But let me turn it around, what real-world problem does this
interrupt prioritization scheme solve coupled with the inability
to disable interrupts solve?
The OS does not have to "walk" its schedule queues in order to
context switch.
The SVR (SuperVisor Return) instruction checks the priority of
the thread that had been interrupted, and if it is of lower
priority than what is pending in the interrupt table, the
one on the interrupt table receives control while the lower
priority thread remains interrupted.
Thus, the return path from an ISR, just does an SVR. If a
DPC/softIRQ has been setup, or another interrupt has become
pending, HW chooses the path--acting as if HW returned control
to the interrupted thread, and then immediately took the inter-
rupt of highest priority still pending--without having executed
a single instruction in the interrupted thread.
Is anyone complaining that the
ability to disable interrupts is somehow a serious burden?
This thread started with the notion of MSI interrupts and the
inherent need to DI upon arrival at ISR dispatcher. This sub-
thread is investigating whether one can perform ISRs without
disabling interrupts. Thus, making IDisablement an unnecessary
part of interrupt servicing itself.
Your (and EricP)'s contribution is to annotate those instances
where one can arrive at an ISR IE and still want/need to DI and EI
independent of the control transfer from and back to interrupted
thread (even if control arrives reentrantly from interrupted thread).
Supervisor return, whatever it is called - sysret, rei, svr -
is traditionally another point where a number of OS race conditions
can force it to briefly disable interrupts.
The classic case is when a DPC/SoftIrq may be posted by an Interrupt
Service Routine (ISR) and causes the First Level Interrupt Handler
(FLIH),
which is the ISR nested first on the stack of priority interrupts,
to exit into the OS rather than return to its prior thread.
The race condition occurs between the point when the FLIH ISR epilogue
checks for pending a pending DPC, and not finding one decides to return
to the prior thread. Between that check and the SVR instruction a higher
priority interrupt occurs and posts a DPC.
I remove that race condition by removing the need for FLIR (which I
call the dispatcher) to check. All the checks are done AT* the SVR instruction. (*) and during
Another race condition can be how does an epilogue know if it is a FLIH
or nested one? A counter for each core incremented by the ISR prologue
and checked by the epilogue would do but what if the nested interrupt
catches a lower level prologue just before it increments the counter.
How to know:: It COULD look at the value in its stack pointer. An
empty stack implies it is leaving, a non-empty stack implies it
has more to do.
But in practice, it does not need to know as interrupts are properly
nested and serviced in priority order.
There are a number of non-interruptible checks and control flag changes
that supervisor return must perform.
To do this I have a single instruction CSVR Conditional Supervisor
Return
which handles all interrupt, exception, and SysCall returns. It performs
a model specific test on an interrupt control register and if True
changes
the ICR state and returns to the prior context. If False makes other ICR
state changes and falls through the CSVR instruction where code can
determine what just changed. This would typically be a jump into the OS
Dispatcher to flush any DPC/SoftIrq and check for thread rescheduling.
This whole return sequence is performed without having to disable
interrupts.
I, too, have a single instruction (SVR) that delivers control to the
highest priority still pending thread, regardless of whether it is
a new interrupt, a scheduled DPC/softIRQ, a local timer, an IPI, or
the thread the interrupt stack was placed upon.
SVR looks at the priority of the thread to receive control, and the priorities of pending 'interrupts' (of which DPC/softIRQ and IPIs
are part) and transfers control to the highest priority pending thread--occasionally that thread is the interrupted one.
Should a higher priority thread become pending while SVR is in progress,
the higher priority thread will receive control before the first inst- ruction of the mis-receiving thread is performed.
As far as you have explained these things to me, I see no race
conditions
in the architected means.
On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:[...]
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
I have almost that in the documentation for my ATOMIC stuff::
The following code moves a doubly linked element from one place
in a concurrent data structure to another without any interested
3rd party being able to see that the element is now within the
CDS at any time::
cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
My architecture has a mechanism to perform ATOMIC stuff over multiple >>>instruction time frames that has the property that a higher priority >>>thread which interferers with a lower priority thread, the HPT wins
and the LPT fails its ATOMIC event. It is for exactly this reason
that I drag priority through the memory hierarchy--so that real-time >>>things remain closer to real-time (no or limited priority inversions).
Without being able to see this in practice, it's difficult to
speculate as to how well it will actually work in real-world
scenarios. What is the scope of what's covered by this atomic
thing?
Sounds a lot like transactional memory. Something that has
yet to prove to be usable in the general case.
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
In this case, I'm probably going to want to take a lock on the
queue, then lock the list, then pop the first element off of the
queue by taking a pointer to the head.
Which can be done with compare and exchange, atomically;
a lock may not be necessary.
The insert into an ordered list will likely require a
spin lock (or a reader-writer lock of some form).
In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:
Sometimes you really don't want to be interrupted.And sometimes you don't want to be interrupted unless the
"house is on fire"; I cannot see a time when "the house is
on fire" that you would not want to take the interrupt.
Is there one ?!?
Consider a thread that takes a spinlock; suppose some
high-priority interrupt comes in while the thread is holding
that lock. In response to the interrupt, software decides to
suspend the thread and switch some other thread; that thread
wants to lock the spin lock that the now-descheduled thread is
holding: a classic deadlock scenario.
A valid response here might be, "don't context switch from the
interrupt handler; use a DPC instead". That may be valid, but
it puts a constraint on the software designer that may be onerus
in practice: suppose the interrupt happens in the scheduler,
while examining a run queue or something. A DPC object must be
available, etc.
Further, software must now consider the complexity of
potentially interruptable critical sections. From the
standpoint of reasoning about already-complex concurrency issues
it's simpler to be able to assert that (almost) all interrupt
delivery can be cheaply disabled entirely, save for very
special, specific, scenarios like NMIs. Potentially switching
away from a thread holding a spinlock sort of defeats the
purpose of a spinlock in the first place, which is a mutex
primitive designed to avoid the overhead of switching.
- Dan C.
In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,---------------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
My architecture has a mechanism to perform ATOMIC stuff over multiple >>instruction time frames that has the property that a higher priority
thread which interferers with a lower priority thread, the HPT wins
and the LPT fails its ATOMIC event. It is for exactly this reason
that I drag priority through the memory hierarchy--so that real-time
things remain closer to real-time (no or limited priority inversions).
Without being able to see this in practice, it's difficult to
speculate as to how well it will actually work in real-world
scenarios. What is the scope of what's covered by this atomic
thing?
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
In this case, I'm probably going to want to take a lock on the
queue, then lock the list, then pop the first element off of the
queue by taking a pointer to the head. Then I'll set the head
to the next element of that item, then walk the list, finding
the place to insert (make sure I'm keeping track of the "prev"
pointer somehow), then do the dance to insert the item: set the
item's next to the thing after where I'm inserting, set the
prev's next to the item if it's not nil, or set the list head
to the item. Then I unlock the list and then the queue (suppose
for some reason that it's important that I hold the lock on the
queue until the element is fully added to the list).
This isn't complicated, and for a relatively small list it's not
expensive. Depending on the circumstances, I'd feel comfortable
doing this with only a spinlocks protecting both the queue and
the list. But it's more than a two or three instructions, and
doesn't feel like something that would fit well into the "may
fail atomic" model.
Mandating to SW that typ SW must use this mechanism will not fly,
at least in the first several iterations of an OS for the system.
So the std. methods must remain available.
AND to a large extend this sub-thread is attempting to get a
usefully large number of them annotated.
It sounds like the whole model is fairly different, and would
require software writers to consider interrupts occuring in
places that wouldn't have to in a more conventional design. I
suspect you'd have a fairly serious porting issue with existing
systems, but maybe I misunderstand.
A valid response here might be, "don't context switch from the
interrupt handler; use a DPC instead". That may be valid, but
it puts a constraint on the software designer that may be onerus
in practice: suppose the interrupt happens in the scheduler,
while examining a run queue or something. A DPC object must be
available, etc.
This seems to be onerous on SW mainly because of too many unknowns
to track down and deal with.
Yes.
Further, software must now consider the complexity of
potentially interruptable critical sections. From the
standpoint of reasoning about already-complex concurrency issues
it's simpler to be able to assert that (almost) all interrupt
delivery can be cheaply disabled entirely, save for very
special, specific, scenarios like NMIs. Potentially switching
away from a thread holding a spinlock sort of defeats the
purpose of a spinlock in the first place, which is a mutex
primitive designed to avoid the overhead of switching.
Agreed.
The idea of priority as described above seems to prize latency
above other considerations, but it's not clear to me that that
is the right tradeoff. Exactly what problem is being solved?
- Dan C.
cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
My architecture has a mechanism to perform ATOMIC stuff over multiple >>>instruction time frames that has the property that a higher priority >>>thread which interferers with a lower priority thread, the HPT wins
and the LPT fails its ATOMIC event. It is for exactly this reason
that I drag priority through the memory hierarchy--so that real-time >>>things remain closer to real-time (no or limited priority inversions).
Without being able to see this in practice, it's difficult to
speculate as to how well it will actually work in real-world
scenarios. What is the scope of what's covered by this atomic
thing?
Sounds a lot like transactional memory. Something that has
yet to prove to be usable in the general case.
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
In this case, I'm probably going to want to take a lock on the
queue, then lock the list, then pop the first element off of the
queue by taking a pointer to the head.
Which can be done with compare and exchange, atomically;
a lock may not be necessary.
The insert into an ordered list will likely require a
spin lock (or a reader-writer lock of some form).
On Thu, 20 Mar 2025 12:50:09 +0000, Dan Cross wrote:
In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
But let me turn it around, what real-world problem does this
interrupt prioritization scheme solve coupled with the inability
to disable interrupts solve?
The OS does not have to "walk" its schedule queues in order to
context switch.
The SVR (SuperVisor Return) instruction checks the priority of
the thread that had been interrupted, and if it is of lower
priority than what is pending in the interrupt table, the
one on the interrupt table receives control while the lower
priority thread remains interrupted.
Thus, the return path from an ISR, just does an SVR. If a
DPC/softIRQ has been setup, or another interrupt has become
pending, HW chooses the path--acting as if HW returned control
to the interrupted thread, and then immediately took the inter-
rupt of highest priority still pending--without having executed
a single instruction in the interrupted thread.
Is anyone complaining that the
ability to disable interrupts is somehow a serious burden?
This thread started with the notion of MSI interrupts and the
inherent need to DI upon arrival at ISR dispatcher. This sub-
thread is investigating whether one can perform ISRs without
disabling interrupts. Thus, making IDisablement an unnecessary
part of interrupt servicing itself.
Your (and EricP)'s contribution is to annotate those instances
where one can arrive at an ISR IE and still want/need to DI and EI >independent of the control transfer from and back to interrupted
thread (even if control arrives reentrantly from interrupted thread).
Many computer designers copy/modify their interrupt architectures
from machines architected prior to the virtualization of Operating
Systems, virtualization of interrupts, HyperVisors, Paranoid threads
(like banking applications) that have survived into the present.
I went the other way, trying to develop a system whereby the path
across context switching domains is cheap and easy*. Not needing
to DI at entry to ISR or exception handlers (more or less) fell
out for free due to the way "the rest of the system" evolved.
(*) cheap for HW, and easy expressly for SW
Not being employed, and not having an OS guy to bounce ideas off
and discuss consequences of going this way or that way is not
available, so I lean on a few people hear to "beat me up" when
my ideas lead me astray.
18 months ago I had a real-time version of the architecture,
with the property that at any instant in time, each core was
executing the highest priority thread, cores had a means to
message back and forth as if it were a remote procedure call
which could return (synchronous; or not--asynchronous) without
an OS mediating these things. I abandoned it not because it
was not good for real time, but because there was no way to
make it Linux Friendly.
Where I am now, is my attempt at being Linux friendly.
On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:
In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,---------------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
My architecture has a mechanism to perform ATOMIC stuff over multiple >>>instruction time frames that has the property that a higher priority >>>thread which interferers with a lower priority thread, the HPT wins
and the LPT fails its ATOMIC event. It is for exactly this reason
that I drag priority through the memory hierarchy--so that real-time >>>things remain closer to real-time (no or limited priority inversions).
Without being able to see this in practice, it's difficult to
speculate as to how well it will actually work in real-world
scenarios. What is the scope of what's covered by this atomic
thing?
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
In this case, I'm probably going to want to take a lock on the
queue, then lock the list, then pop the first element off of the
queue by taking a pointer to the head. Then I'll set the head
to the next element of that item, then walk the list, finding
the place to insert (make sure I'm keeping track of the "prev"
pointer somehow), then do the dance to insert the item: set the
item's next to the thing after where I'm inserting, set the
prev's next to the item if it's not nil, or set the list head
to the item. Then I unlock the list and then the queue (suppose
for some reason that it's important that I hold the lock on the
queue until the element is fully added to the list).
I have almost that in the documentation for my ATOMIC stuff::
The following code moves a doubly linked element from one place
in a concurrent data structure to another without any interested
3rd party being able to see that the element is now within the
CDS at any time::
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
The ESM macros simply tell the compiler to set a bit in the LD/ST >instructions. esmINTERFERENCE is a query on the state of the event
and whether a 2nd party has performed a memory action that kills
the event.
It is required to touch every cache line participating in the event
prior to attempting the ATOMIC update--this is what esmLOCKprefetch
does. One can use up to 8 cache lines in a single ATOMIC event.
Either everyone in the system sees the update and at the same
instant in time, or no one sees any update at all.
Technically, there is no one holding a lock ...
This isn't complicated, and for a relatively small list it's not
expensive. Depending on the circumstances, I'd feel comfortable
doing this with only a spinlocks protecting both the queue and
the list. But it's more than a two or three instructions, and
doesn't feel like something that would fit well into the "may
fail atomic" model.
Should be obvious how to change the insertion into a singly linked
list in a different data structure.
Mandating to SW that typ SW must use this mechanism will not fly,
at least in the first several iterations of an OS for the system.
So the std. methods must remain available.
AND to a large extend this sub-thread is attempting to get a
usefully large number of them annotated.
It sounds like the whole model is fairly different, and would
require software writers to consider interrupts occuring in
places that wouldn't have to in a more conventional design. I
suspect you'd have a fairly serious porting issue with existing
systems, but maybe I misunderstand.
Which is why ID is available.
A valid response here might be, "don't context switch from the
interrupt handler; use a DPC instead". That may be valid, but
it puts a constraint on the software designer that may be onerus
in practice: suppose the interrupt happens in the scheduler,
while examining a run queue or something. A DPC object must be
available, etc.
This seems to be onerous on SW mainly because of too many unknowns
to track down and deal with.
Yes.
Further, software must now consider the complexity of
potentially interruptable critical sections. From the
standpoint of reasoning about already-complex concurrency issues
it's simpler to be able to assert that (almost) all interrupt
delivery can be cheaply disabled entirely, save for very
special, specific, scenarios like NMIs. Potentially switching
away from a thread holding a spinlock sort of defeats the
purpose of a spinlock in the first place, which is a mutex
primitive designed to avoid the overhead of switching.
Agreed.
The idea of priority as described above seems to prize latency
above other considerations, but it's not clear to me that that
is the right tradeoff. Exactly what problem is being solved?
The instruction path into and out of handlers is shorter. Basically
nobody has to look at the schedule queues because HW will do it for
you. For example: GuestOS interrupt dispatcher is approximately::
-- +-------------+-----+------------------------+
R0 | ----------- | ISR | --- MSI-X message ---- |
-- +-------------+-----+------------------------+
Dispatcher:
// at this point we are reentrant
SR R7,R0,<39:32>
CMP R2,R7,MaxTable
PLO R2,TTTF
SR R1,R0,<32:0>
CALX [IP,R7<<3,Table-.]
SVR
// if we get here R0<37:32>
// was out of bounds
This does all of the register file saving and restoring, all of
the DPC/softIRQ stuff, all of the checking for additional ISRs,
IPIs, ... Changes priority, privilege, ROOT pointers and associated
ASIDs from and back to.
MitchAlsup1 wrote:
On Sun, 23 Mar 2025 16:16:16 +0000, EricP wrote:
But all of this adds to the return from interrupt code path length.
------------------
I, too, have a single instruction (SVR) that delivers control to the
highest priority still pending thread, regardless of whether it is
a new interrupt, a scheduled DPC/softIRQ, a local timer, an IPI, or
the thread the interrupt stack was placed upon.
SVR looks at the priority of the thread to receive control, and the
priorities of pending 'interrupts' (of which DPC/softIRQ and IPIs
are part) and transfers control to the highest priority pending
thread--occasionally that thread is the interrupted one.
The new idea I'm proposing is CSVR *Conditional* Supervisor Return
which can fall through in the unlikely event something disturbs the non-interruptible return sequence.
This can be used to optimize the return path by avoiding returning from
an interrupt only to immediately take a priority-1 software interrupt.
Should a higher priority thread become pending while SVR is in progress,
the higher priority thread will receive control before the first inst-
ruction of the mis-receiving thread is performed.
As far as you have explained these things to me, I see no race
conditions
in the architected means.
Perhaps not for your approach but I'm looking at other optimizations
that might introduce them. One was optimizing the common return path:
Before a supervisor return starts to reload the saved register set
it tests if a DPC is pending and about to generate a software interrupt,
and if so jumps back into the Dispatcher. This saves reloading the
registers
only to immediately take another interrupt, a pipeline drain and fill,
and resave the same registers.
To do this optimization, before it starts to reload the registers
it tests the Interrupt Control Register (ICR) to check if it is about to trigger a SWI and if so adjusts the ICR and jumps into the Dispatcher. Otherwise it restores registers and does a Conditional Supervisor Return which reperforms the same test and can fall through if the test fails.
This also requires that kernel/interrupt stacks be switched a certain
way,
which just so happens to be the same as the x86 way. That is, hardware
only
switches stacks if interrupted mode is User. This leaves it up to OS
software to decide which stack to use at what points inside the kernel.
Also I'd save the interrupt context, the old RIP, old RSP and old
interrupt
status in hardware registers, not on the stack, and the interrupt vector
RIP address is calculated based on the priority. This eliminates all
memory
access from the exception and interrupt entry and exit transition
sequences,
and thereby eliminates page faults and memory error possibility and
cache access latency.
MitchAlsup1 [2025-03-24 17:36:28] wrote:
On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:[...]
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
I have almost that in the documentation for my ATOMIC stuff::
The following code moves a doubly linked element from one place
in a concurrent data structure to another without any interested
3rd party being able to see that the element is now within the
CDS at any time::
Would your "ATOMIC stuff" really handle Dan's example, tho?
A key element in Dan's example if the "ordered singly-linked list",
i.e. you need to walk the list an unbounded number of steps until you
can find the insertion point, so I don't think you can do the whole
thing inside your ATOMIC stuff.
IOW, you'd have to somehow move the list-walk outside of the atomic
section.
E.g. if you know elements are only ever removed from the tail of the
list, you can presumably walk the list outside of the ATOMIC stuff and
then do the "remove from the queue and insert into the list" atomically
since it's a small&finite number of steps (with a test to loop back to
the list-walk if someone else inserted or removed anything at that point
in the mean time).
Stefan
On 3/24/2025 11:49 AM, Dan Cross wrote:
In article <pzdEP.126362$f5K3.26821@fx36.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
My architecture has a mechanism to perform ATOMIC stuff over multiple >>>>> instruction time frames that has the property that a higher priority >>>>> thread which interferers with a lower priority thread, the HPT winsWithout being able to see this in practice, it's difficult to
and the LPT fails its ATOMIC event. It is for exactly this reason
that I drag priority through the memory hierarchy--so that real-time >>>>> things remain closer to real-time (no or limited priority inversions). >>>>
speculate as to how well it will actually work in real-world
scenarios. What is the scope of what's covered by this atomic
thing?
Sounds a lot like transactional memory. Something that has
yet to prove to be usable in the general case.
I understand that some mallocs have done it; McRT-Malloc from
Intel, if I'm not mistaken.
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
In this case, I'm probably going to want to take a lock on the
queue, then lock the list, then pop the first element off of the
queue by taking a pointer to the head.
Which can be done with compare and exchange, atomically;
a lock may not be necessary.
The insert into an ordered list will likely require a
spin lock (or a reader-writer lock of some form).
I'm sure one could do a lockless pop using a cmp exchange, but I
wanted to a scenario where one would want to hold onto both
locks throughout the critical section for some reason.
I just don't see how this works in the proposed scenario.
You want to hold lock A while lock B us being used? That can be tricky.
Well, there is a locking order.
Dan Cross wrote:
In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:
Sometimes you really don't want to be interrupted.And sometimes you don't want to be interrupted unless the
"house is on fire"; I cannot see a time when "the house is
on fire" that you would not want to take the interrupt.
Is there one ?!?
Consider a thread that takes a spinlock; suppose some
high-priority interrupt comes in while the thread is holding
that lock. In response to the interrupt, software decides to
suspend the thread and switch some other thread; that thread
wants to lock the spin lock that the now-descheduled thread is
holding: a classic deadlock scenario.
Terminology: mutexes coordinate mutual exclusion between threads,
spinlocks coordinate mutual exclusion between cpu cores.
Windows "critical sections" are mutexes with a fast path.
A valid response here might be, "don't context switch from the
interrupt handler; use a DPC instead". That may be valid, but
it puts a constraint on the software designer that may be onerus
in practice: suppose the interrupt happens in the scheduler,
while examining a run queue or something. A DPC object must be
available, etc.
That is exactly what DPC/SoftIrq are design to do - allow the bulk of
the kernel, like the scheduler, non-paged heap management, IO pre and
post processing, to be interrupted without causing reentrancy.
The higher level device interrupt enqueues a request to the lower software >interrupt level, which is processed when it will not cause reentrancy.
That constraint is by design and any good OS will crash if violated.
Further, software must now consider the complexity of
potentially interruptable critical sections. From the
standpoint of reasoning about already-complex concurrency issues
it's simpler to be able to assert that (almost) all interrupt
delivery can be cheaply disabled entirely, save for very
special, specific, scenarios like NMIs. Potentially switching
away from a thread holding a spinlock sort of defeats the
purpose of a spinlock in the first place, which is a mutex
primitive designed to avoid the overhead of switching.
This is why I mentioned the terminology thing: threads do not hold
spinlocks, they hold mutexes.
Threads and mutexes are a fiction created
by the OS scheduler, and switching threads while waiting for a mutex
is exactly what they are supposed to do.
Spinlocks are used by cores to coordinate access to shared memory by
things like a OS scheduler looking at list of threads waiting for a mutex.
I like to think of it as all having to do with hats.
The cpu is wearing one of three hats: a thread hat when it pretends to be
a time shared execution machine; a core hat when it acts as a single >execution machine running non-reentrant things like a scheduler which
creates the fiction of threads and processes (multiple virtual spaces);
and an interrupt hat when executing nestable reentrant interrupts.
The interrupt mechanism coordinates exactly when and how we switch hats.
On 3/24/2025 11:49 AM, Dan Cross wrote:
[snip]
I'm sure one could do a lockless pop using a cmp exchange, but I
wanted to a scenario where one would want to hold onto both
locks throughout the critical section for some reason.
I just don't see how this works in the proposed scenario.
You want to hold lock A while lock B us being used? That can be tricky.
Well, there is a locking order. Check out this older work of mine, multex:
https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ
Not sure if this is relevant or not.
On Mon, 24 Mar 2025 18:02:25 +0000, Stefan Monnier wrote:
MitchAlsup1 [2025-03-24 17:36:28] wrote:
On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:[...]
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
I have almost that in the documentation for my ATOMIC stuff::
The following code moves a doubly linked element from one place
in a concurrent data structure to another without any interested
3rd party being able to see that the element is now within the
CDS at any time::
Would your "ATOMIC stuff" really handle Dan's example, tho?
A key element in Dan's example if the "ordered singly-linked list",
i.e. you need to walk the list an unbounded number of steps until you
can find the insertion point, so I don't think you can do the whole
thing inside your ATOMIC stuff.
Technically, yes, you can put a search loop in an ATOOMIC event.
Whether you want to or not is a different story. My example
illustrated the ATOMIC manipulation of the event without a search.
In addition, if the search it long, code developer should either
to the search before the event, or grab a Mutex to prevent
interference during the search and rest of the critical region.
IOW, you'd have to somehow move the list-walk outside of the atomic
section.
Yes,
E.g. if you know elements are only ever removed from the tail of the
list, you can presumably walk the list outside of the ATOMIC stuff and
then do the "remove from the queue and insert into the list" atomically
since it's a small&finite number of steps (with a test to loop back to
the list-walk if someone else inserted or removed anything at that point
in the mean time).
Or maintain a tail pointer.
In article <96987d473656f296ec63c30626a46730@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 24 Mar 2025 18:02:25 +0000, Stefan Monnier wrote:
MitchAlsup1 [2025-03-24 17:36:28] wrote:
On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:[...]
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
I have almost that in the documentation for my ATOMIC stuff::
The following code moves a doubly linked element from one place
in a concurrent data structure to another without any interested
3rd party being able to see that the element is now within the
CDS at any time::
Would your "ATOMIC stuff" really handle Dan's example, tho?
A key element in Dan's example if the "ordered singly-linked list",
i.e. you need to walk the list an unbounded number of steps until you
can find the insertion point, so I don't think you can do the whole
thing inside your ATOMIC stuff.
Technically, yes, you can put a search loop in an ATOOMIC event.
Whether you want to or not is a different story. My example
illustrated the ATOMIC manipulation of the event without a search.
In addition, if the search it long, code developer should either
to the search before the event, or grab a Mutex to prevent
interference during the search and rest of the critical region.
The problem, as specified, excluded this idea that the list was
long.
The point was simply to show a scenario where your atomic
operation proposal breaks down, and where you might want to
disable interrupt delivery.
The point wasn't to ask whether
the specific problem could be recast in a different way; it
probably can. The question was to ask whether your proposal
could support that specific problem. I don't see how it can.
IOW, you'd have to somehow move the list-walk outside of the atomic
section.
Yes,
E.g. if you know elements are only ever removed from the tail of the
list, you can presumably walk the list outside of the ATOMIC stuff and
then do the "remove from the queue and insert into the list" atomically
since it's a small&finite number of steps (with a test to loop back to
the list-walk if someone else inserted or removed anything at that point >>> in the mean time).
Or maintain a tail pointer.
There's no guarantee that the tail is the correct place to
insert the element, vis the order of the elements.
- Dan C.
In article <kVgEP.1277108$_N6e.605199@fx17.iad>,-------------------
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:
In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
This is why I mentioned the terminology thing: threads do not hold >>spinlocks, they hold mutexes.
See above. Threads can certainly "hold" a spin lock, as they
can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
page 202:
|On a uniprocessor, if a thread tries to acquire a spin lock
|that is already held, it will loop forever. Multiprocessor
|algorithms, however, must operate correctly regardless of the
|number of processors, which means that they should handle the
|uniprocessor case as well. This requires strict adherence to
|the rule that threads not relinquish control of the CPU while
|holding a spin lock.
On Mon, 24 Mar 2025 20:11:46 +0000, Dan Cross wrote:
In article <kVgEP.1277108$_N6e.605199@fx17.iad>,-------------------
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:
In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
This is why I mentioned the terminology thing: threads do not hold >>>spinlocks, they hold mutexes.
See above. Threads can certainly "hold" a spin lock, as they
can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
page 202:
|On a uniprocessor, if a thread tries to acquire a spin lock
|that is already held, it will loop forever. Multiprocessor
|algorithms, however, must operate correctly regardless of the
|number of processors, which means that they should handle the
|uniprocessor case as well. This requires strict adherence to
|the rule that threads not relinquish control of the CPU while
|holding a spin lock.
My 66000 ATOMIC stuff is designed such that if control has to leave
the ATOMIC event, all HW knowledge of being in that event disappears
and IP is modified (prior to control transfer) such that upon return
SW knows the event failed (or never started), and that no update of >participating data in the event ever becomes visible.
HW can do this with address-based events, but not with data-based
events. LDL-STC are address-based events used to manipulate data-
based events. Mine just allow more LDs and STs in the event than 1.
So, if you want the property whereby the lock disappears on any
control transfer out of the event {exception, interrupt, SVC, SVR, ...};
then you want to use my ATOMIC stuff; otherwise, you can use the
normal ATOMIC primitives everyone and his brother provide.
On Mon, 24 Mar 2025 20:28:40 +0000, Dan Cross wrote:
In article <96987d473656f296ec63c30626a46730@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 24 Mar 2025 18:02:25 +0000, Stefan Monnier wrote:
MitchAlsup1 [2025-03-24 17:36:28] wrote:
On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:[...]
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
I have almost that in the documentation for my ATOMIC stuff::
The following code moves a doubly linked element from one place
in a concurrent data structure to another without any interested
3rd party being able to see that the element is now within the
CDS at any time::
Would your "ATOMIC stuff" really handle Dan's example, tho?
A key element in Dan's example if the "ordered singly-linked list",
i.e. you need to walk the list an unbounded number of steps until you
can find the insertion point, so I don't think you can do the whole
thing inside your ATOMIC stuff.
Technically, yes, you can put a search loop in an ATOOMIC event.
Whether you want to or not is a different story. My example
illustrated the ATOMIC manipulation of the event without a search.
In addition, if the search it long, code developer should either
to the search before the event, or grab a Mutex to prevent
interference during the search and rest of the critical region.
The problem, as specified, excluded this idea that the list was
long.
The point was simply to show a scenario where your atomic
operation proposal breaks down, and where you might want to
disable interrupt delivery.
My ATOMIC stuff allows for the construction, in SW, of pretty
much any ATOMIC primitive anyone would like. It is, in general,
not suited to guard critical regions, but to grab the lock which
does.
Most uses of My ATOMIC stuff is to grab/change/update data
that then guards various critical regions. More like a multi-
valued version of LDL-STC.
The point wasn't to ask whether
the specific problem could be recast in a different way; it
probably can. The question was to ask whether your proposal
could support that specific problem. I don't see how it can.
To the question posed, you should just grab the lock/mutex.
IOW, you'd have to somehow move the list-walk outside of the atomic
section.
Yes,
E.g. if you know elements are only ever removed from the tail of the
list, you can presumably walk the list outside of the ATOMIC stuff and >>>> then do the "remove from the queue and insert into the list" atomically >>>> since it's a small&finite number of steps (with a test to loop back to >>>> the list-walk if someone else inserted or removed anything at that point >>>> in the mean time).
Or maintain a tail pointer.
There's no guarantee that the tail is the correct place to
insert the element, vis the order of the elements.
If the element pointer updates and the tail pointer update are inside
one of my ATOMIC events, then you are sure that the tail pointer is
always pointing at the last element in the list (if there is one).
In article <a86295425dfb043a1b44e550a1c05659@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 24 Mar 2025 20:11:46 +0000, Dan Cross wrote:
In article <kVgEP.1277108$_N6e.605199@fx17.iad>,-------------------
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:
In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
This is why I mentioned the terminology thing: threads do not hold >>>>spinlocks, they hold mutexes.
See above. Threads can certainly "hold" a spin lock, as they
can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
page 202:
|On a uniprocessor, if a thread tries to acquire a spin lock
|that is already held, it will loop forever. Multiprocessor
|algorithms, however, must operate correctly regardless of the
|number of processors, which means that they should handle the
|uniprocessor case as well. This requires strict adherence to
|the rule that threads not relinquish control of the CPU while
|holding a spin lock.
My 66000 ATOMIC stuff is designed such that if control has to leave
the ATOMIC event, all HW knowledge of being in that event disappears
and IP is modified (prior to control transfer) such that upon return
SW knows the event failed (or never started), and that no update of >>participating data in the event ever becomes visible.
HW can do this with address-based events, but not with data-based
events. LDL-STC are address-based events used to manipulate data-
based events. Mine just allow more LDs and STs in the event than 1.
So, if you want the property whereby the lock disappears on any
control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>then you want to use my ATOMIC stuff; otherwise, you can use the
normal ATOMIC primitives everyone and his brother provide.
I definitely do Not want that property. Having a lock
"disappear" on delivery of an interrupt seems like it would be a
great way to introduce silent corruption.
If your atomic primitives are not sufficient to cover an entire
critical section, as I believe we have established, and are
mostly intended to be used for building concurrency primitives
like locks, then I expect that software would take the lock, see
that it succeeded, and enter the critical section.
If at that
point the lock were silently discarded on receipt of an
interrupt,
then potentially some other thread could subsequently
take the lock,
enter the critical section, and observe a data
structure in an inconsistent state (leading to a crash in the
best case).
This is precisely the sort of thing we build mutual
exclusion primitives to wrap around critical sections to avoid.
If, on the other hand, the atomic primitives are just meant for
the spin loop, then I don't really see how that's all that
useful compared to LL/SC.
And you still need some way to ensure
that the critical section is not interrupted.
- Dan C.
It means that if a thread owning a lock is interrupted, that
thread gives the lock back--so there is no priority inversion
and no lock-starvation. Associated with the give back, and the
IP of the interrupted thread no longer points inside the
critical section.
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:
Interrupt latency is dependent on executing instructions,
Lowest waste of power
But note: In most cases, it already took the interrupt ~150 nanoseconds
to arrive at the interrupt service port. 1 trip from device to DRAM >>(possibly serviced by L3), 1 trip from DRAM back to device, 1 tip from >>device to interrupt service port; and 4 DRAM (or L3) accesses to log >>interrupt into table.
How do you handle timer interrupts? Those are generally not
message based, but rather internal signals within the core that
raise an interrupt (perhaps coordinating with the interrupt controller
for priority, security state, and enable bit). These are generally
wired (and the enables/state/priority are flops) for low latency.
There will generally be at least one general purpose timer per
core plus any watchdog timers.
On 2025-03-17 2:33 p.m., EricP wrote:
Another problem is what does the core do with the in flight instructions.
Method 2 pipelines the switch by injecting the interrupt request at
Fetch.
Decode converts the request to a special uOp that travels down the IQ
to Retire and allows all the older work to complete.
This is more complex as it requires a two phase hand-off from the
Interrupt Control Unit (ICU) to the core as a branch mispredict in the
in flight instructions might cause a tentative interrupt acceptance to
later
be withdrawn.
Oh my, I forgot.
Method 2 will be used as it is desired to able to insert either an instruction at the fetch stage or a vector address.
Would it not be rare for this to occur? If so, I think the pipeline
The ICU believes the core is in a state to accept a higher priority
interrupt. It sends a request to core, which checks its current state and
sends back an immediate INT_ACK if _might_ accept and stalls Fetch, or
a NAK.
When the special uOp reaches Retire, it sends a signal to Fetch which
then sends an INT_ACCEPT signal to ICU to complete the handoff.
If a branch mispredict occurs that causes interrupts to be disabled,
then Fetch sends an INT_REJECT to ICU, and unstalls its fetching.
(Yes that is not optimal - make it work first, make it work well second.)
This also raises a question about what the ICU is doing during this
long latency handoff. One wouldn't want ICU to sit idle so it might
have to manage the handoff of multiple interrupts to multiple cores
at the same time, each as its own little state machine.
One should see that this decision on how the core handles the
handoff has a large impact on the design complexity of the ICU.
Personally, I would use method 1 first to get something working
then if this is OoO think about getting fancy with method 2.
could just be flushed from the int disabling instruction on. Then the interrupted address recorded as the int disabling instruction. <- this requires identification of an int disabling instruction though. It might
just be a store to a specific I/O address, therefore a special store
opcode to make it easy to detect?
On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:Highest chance of mucking it all up
Michael S wrote:------------------
On Mon, 17 Mar 2025 13:38:12 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
The problem Robert is talking about arises when there are many
interrupt source and many target CPUs.
The required routing/prioritization/acknowledgment logic (at least a
naive logic I am having in mind) would be either non-scalable or
relatively complicated. Process of selection for the second case will
take multiple cycles (I am thinking about ring).
Another problem is what does the core do with the in flight
instructions.
Method 1 is simplest, it injects the interrupt request at Retire
as that's where the state of everything is synchronized.
The consequence is that, like exceptions, the in flight instructions all
get purged, and we save the committed RIP, RSP and interrupt control
word.
While that might be acceptable for a 5 stage in-order pipeline,
it could be pretty expensive for an OoO 200+ instruction queue
potentially tossing hundreds of cycles of near finished work.
Lowest interrupt Latency
Highest waste of power (i.e., work)
Method 2 pipelines the switch by injecting the interrupt request at
Fetch.
Decode converts the request to a special uOp that travels down the IQ
to Retire and allows all the older work to complete.
This is more complex as it requires a two phase hand-off from the
Interrupt Control Unit (ICU) to the core as a branch mispredict in the
in flight instructions might cause a tentative interrupt acceptance to
later be withdrawn.
Interrupt latency is dependent on executing instructions,
Lowest waste of power
But note: In most cases, it already took the interrupt ~150 nanoseconds
to arrive at the interrupt service port. 1 trip from device to DRAM
(possibly serviced by L3), 1 trip from DRAM back to device, 1 tip from
device to interrupt service port; and 4 DRAM (or L3) accesses to log interrupt into table.
Also, in most cases, the 200-odd instructions in the window will finish
in 100-cycles or as little as 20ns--but if the FDIV unit is saturated, interrupt latency could be as high as 640 cycles and as long as 640ns.
The ICU believes the core is in a state to accept a higher priority
interrupt. It sends a request to core, which checks its current state
and
sends back an immediate INT_ACK if _might_ accept and stalls Fetch, or a
NAK.
In My 66000, ICU knows nothing about the priority level (or state)
of any core in the system. Instead, when a new higher priority
interrupt is raised, the ISP broadcasts a 64-bit mask indicating
which priority levels in the interrupt table have pending inter-
rupts with an MMI/O message to the address of the interrupt table.
All cores monitoring that interrupt table capture the broadcast,
and each core decides to negotiate for an (not that) interrupt
by requesting the highest priority interrupt from the table.
When the request returns, and it is still at a higher priority
than the core is running, core performs interrupt control transfer.
If the interrupt is below the core's priority it is returned to
ISP as if NAKed.
Prior to interrupt control transfer, core remains running what-
ever it was running--and all the interrupt stuff is done by state
machines at the edge of the core and the L3/DRAM controller.
When the special uOp reaches Retire, it sends a signal to Fetch which
then sends an INT_ACCEPT signal to ICU to complete the handoff.
If a branch mispredict occurs that causes interrupts to be disabled,
then Fetch sends an INT_REJECT to ICU, and unstalls its fetching.
(Yes that is not optimal - make it work first, make it work well
second.)
This also raises a question about what the ICU is doing during this
long latency handoff. One wouldn't want ICU to sit idle so it might
have to manage the handoff of multiple interrupts to multiple cores
at the same time, each as its own little state machine.
One must assume that ISP is capable of taking a new interrupt
from a device every 5-ish cycles and interrupt handoff is in the
range of 50 cycles, and that each interrupt could be to a different
interrupt table.
My 66000 ISP treats successive requests to any one table as strongly
ordered, and requests to different tables as completely unordered.
One should see that this decision on how the core handles the
handoff has a large impact on the design complexity of the ICU.
I did not "see" that in My 66000's interrupt architecture. The ISP
complexity is fixed, and the core's interrupt negotiator is a small
state machine (~10-states).
ISP essentially performs 4-5 64-bit memory accesses, and possibly
1 MMI/O 64-bit broadcast on arrival of MSI-X interrupt. Then if
a core negotiates, it performs 3 more memory accesses per negotiator.
Robert Finch wrote:----------------
On 2025-03-17 2:33 p.m., EricP wrote:
This also raises a question about what the ICU is doing during thisWould it not be rare for this to occur? If so, I think the pipeline
long latency handoff. One wouldn't want ICU to sit idle so it might
have to manage the handoff of multiple interrupts to multiple cores
at the same time, each as its own little state machine.
One should see that this decision on how the core handles the
handoff has a large impact on the design complexity of the ICU.
Personally, I would use method 1 first to get something working
then if this is OoO think about getting fancy with method 2.
could just be flushed from the int disabling instruction on. Then the
interrupted address recorded as the int disabling instruction. <- this
requires identification of an int disabling instruction though. It might
just be a store to a specific I/O address, therefore a special store
opcode to make it easy to detect?
It likely is rare but the possibility that the existing instructions,
a branch mispredict or an exception throw, could retroactively change
the state of the interrupt enable flag to disabled creates this
Window of Uncertainty (WoU) that is the length of the pipeline.
Unless one can find a way to make WoU go away the ICU needs to deal
with the possibility that there could be hundreds of clocks between
when ICU asks a core if it can accept an interrupt and when the core
does.
I had a few ideas walking in the park yesterday. I had been assuming
that
when an exception is delivered that it would also automatically disable interrupts. That may not be a good idea as it means that any instruction
that can throw an exception could cause this retroactive disabling.
If I remove that exception side effect then the only things that can manipulate the core's master Interrupt Enable flag are the enable INTEN
and disable INTDI instructions, and the actual delivery of an interrupt.
And those three things can be tracked with a counter in Decode as they
move down the pipeline to Retire. If the counter >0 then there is a
flag change in flight and we don't accept a new interrupt. When count ==
0
and Interrupt Enable flag == 1 then core accepts interrupts, which are injected at Decode and cause a light weight purge of the Fetch buffer.
Another pipeline issue is Privileged Control Register (PCR) reads and
writes.
PCR's don't behave like general registers in that they are not renamed,
and in general operations cannot be replayed or performed out of order.
They behave more like atomic stores that must be performed at Retire
and must not be bypassed by any memory and other PCR ops. (This is why
many of the x86/x64 MSR ops are serialized with a pipeline drain.)
This affects what it can do with PCR inside the WoU if it causes a state change that cannot be undone, for example, reading the interrupt data
FIFO.
On 2025-03-18 2:51 p.m., EricP wrote:----------------
Seems like it could work. There are difficulties using counters with theWould it not be rare for this to occur? If so, I think the pipeline
could just be flushed from the int disabling instruction on. Then the
interrupted address recorded as the int disabling instruction. <- this
requires identification of an int disabling instruction though. It
might just be a store to a specific I/O address, therefore a special
store opcode to make it easy to detect?
It likely is rare but the possibility that the existing instructions,
a branch mispredict or an exception throw, could retroactively change
the state of the interrupt enable flag to disabled creates this
Window of Uncertainty (WoU) that is the length of the pipeline.
Unless one can find a way to make WoU go away the ICU needs to deal
with the possibility that there could be hundreds of clocks between
when ICU asks a core if it can accept an interrupt and when the core
does.
I had a few ideas walking in the park yesterday. I had been assuming
that
when an exception is delivered that it would also automatically disable
interrupts. That may not be a good idea as it means that any instruction
that can throw an exception could cause this retroactive disabling.
If I remove that exception side effect then the only things that can
manipulate the core's master Interrupt Enable flag are the enable INTEN
and disable INTDI instructions, and the actual delivery of an interrupt.
And those three things can be tracked with a counter in Decode as they
move down the pipeline to Retire. If the counter >0 then there is a
flag change in flight and we don't accept a new interrupt. When count ==
0
and Interrupt Enable flag == 1 then core accepts interrupts, which are
injected at Decode and cause a light weight purge of the Fetch buffer.
Q+ ISA as multiple interrupt enable bits could be manipulated at the
same time. One interrupt may be enabled and another one disabled by the
same instruction.
On 2025-03-18 6:49 p.m., MitchAlsup1 wrote:
<snip>
I am not following this. I think there needs to be some way to disable interrupts that should not be occurring. I think a diagram would help. +-------+--------+-----+--------+
| fetch | decode | ... | retire |
+-------+--------+-----+--------+
| IRQ^ | .... | ... | INTDI |
+-------+--------+-----+--------+
These two are happening at the same time.
The INTDI (interrupt disable) should effectively
cause the IRQ to be ignored.
Instructions that are fetched after the IRQ
are relying on interrupts disabled.
Suppose there is an interrupt disable instruction in the pipeline that
has not been performed yet
(they are not done until the retire stage in
Q+). So, interrupts are enabled when detected at an earlier pipeline
stage (fetch). It turns out that the interrupt should not have been
fetched from the interrupt queue as interrupts are about to be disabled
by an earlier instruction.
I think this matters because of the purpose
of disabling the interrupts. There could be other instructions after the
int disable including yet to be fetched instructions relying on the fact
that interrupts are disabled.
If the interrupt did not affect other aspects of the instruction being interrupted (it maybe just sets an interrupt bit in the pipeline
register), I think it might be possible to handle this by having an
interrupt enable flag in the pipeline registers.
That way when the
interrupt is disabled, all the interrupt enable flags in the pipeline registers could be cleared. When the disabled interrupt reaches retire,
it would have to be logged (queued) so that it can be executed when interrupts are re-enabled.
It gets worse with branches involved.
On 3/24/2025 1:14 PM, Dan Cross wrote:
In article <vrsaro$1faj3$9@dont-email.me>,
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 3/24/2025 11:49 AM, Dan Cross wrote:
[snip]
I'm sure one could do a lockless pop using a cmp exchange, but I
wanted to a scenario where one would want to hold onto both
locks throughout the critical section for some reason.
I just don't see how this works in the proposed scenario.
You want to hold lock A while lock B us being used? That can be tricky.
Really? This happens all the time. Of course it can be tricky,
as concurrent programming always is, but it's incredibly normal.
Locking order comes into play here. Uggg... I have seen nightmare code
that was using recursion as well... It deadlocked at a certain depth,
only under the right conditions. The locking order was not properly >respected...
On 3/24/2025 7:02 PM, Chris M. Thomasson wrote:
On 3/24/2025 1:14 PM, Dan Cross wrote:
In article <vrsaro$1faj3$9@dont-email.me>,
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 3/24/2025 11:49 AM, Dan Cross wrote:Really? This happens all the time. Of course it can be tricky,
[snip]
I'm sure one could do a lockless pop using a cmp exchange, but I
wanted to a scenario where one would want to hold onto both
locks throughout the critical section for some reason.
I just don't see how this works in the proposed scenario.
You want to hold lock A while lock B us being used? That can be tricky. >>>
as concurrent programming always is, but it's incredibly normal.
Locking order comes into play here. Uggg... I have seen nightmare code
that was using recursion as well... It deadlocked at a certain depth,
only under the right conditions. The locking order was not properly
respected...
A thread would take mtxA, then mtxB, while another would take mtxB, then >mtxA. Deadlock city is going to be visited rather soon...
On 3/24/2025 1:28 PM, Dan Cross wrote:
In article <96987d473656f296ec63c30626a46730@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 24 Mar 2025 18:02:25 +0000, Stefan Monnier wrote:
MitchAlsup1 [2025-03-24 17:36:28] wrote:
On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:[...]
Consider something as simple as popping an item off of the front
of a queue and inserting it into an ordered singly-linked list:
I have almost that in the documentation for my ATOMIC stuff::
The following code moves a doubly linked element from one place
in a concurrent data structure to another without any interested
3rd party being able to see that the element is now within the
CDS at any time::
Would your "ATOMIC stuff" really handle Dan's example, tho?
A key element in Dan's example if the "ordered singly-linked list",
i.e. you need to walk the list an unbounded number of steps until you
can find the insertion point, so I don't think you can do the whole
thing inside your ATOMIC stuff.
Technically, yes, you can put a search loop in an ATOOMIC event.
Whether you want to or not is a different story. My example
illustrated the ATOMIC manipulation of the event without a search.
In addition, if the search it long, code developer should either
to the search before the event, or grab a Mutex to prevent
interference during the search and rest of the critical region.
The problem, as specified, excluded this idea that the list was
long.
The point was simply to show a scenario where your atomic
operation proposal breaks down, and where you might want to
disable interrupt delivery. The point wasn't to ask whether
the specific problem could be recast in a different way; it
probably can. The question was to ask whether your proposal
could support that specific problem. I don't see how it can.
IOW, you'd have to somehow move the list-walk outside of the atomic
section.
Yes,
E.g. if you know elements are only ever removed from the tail of the
list, you can presumably walk the list outside of the ATOMIC stuff and >>>> then do the "remove from the queue and insert into the list" atomically >>>> since it's a small&finite number of steps (with a test to loop back to >>>> the list-walk if someone else inserted or removed anything at that point >>>> in the mean time).
Or maintain a tail pointer.
There's no guarantee that the tail is the correct place to
insert the element, vis the order of the elements.
https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html
On Mon, 24 Mar 2025 23:45:49 +0000, Dan Cross wrote:
In article <a86295425dfb043a1b44e550a1c05659@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
[snip]
My 66000 ATOMIC stuff is designed such that if control has to leave
the ATOMIC event, all HW knowledge of being in that event disappears
and IP is modified (prior to control transfer) such that upon return
SW knows the event failed (or never started), and that no update of >>>participating data in the event ever becomes visible.
HW can do this with address-based events, but not with data-based
events. LDL-STC are address-based events used to manipulate data-
based events. Mine just allow more LDs and STs in the event than 1.
So, if you want the property whereby the lock disappears on any
control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>then you want to use my ATOMIC stuff; otherwise, you can use the
normal ATOMIC primitives everyone and his brother provide.
I definitely do Not want that property. Having a lock
"disappear" on delivery of an interrupt seems like it would be a
great way to introduce silent corruption.
It means that if a thread owning a lock is interrupted, that
thread gives the lock back--so there is no priority inversion
and no lock-starvation. Associated with the give back, and the
IP of the interrupted thread no longer points inside the
critical section.
Nothing gets lost.
{{Perhaps disappear was a poorly chosen word.}}
If your atomic primitives are not sufficient to cover an entire
critical section, as I believe we have established, and are
mostly intended to be used for building concurrency primitives
like locks, then I expect that software would take the lock, see
that it succeeded, and enter the critical section.
If at that
point the lock were silently discarded on receipt of an
interrupt,
At this point your thread in not IN the critical section !!
The closest you can be is at the first instruction beginning
an ATOMIC event (the never happened case) or you can also
be at the point where you KNOW that some other interfered
with your actions, and that they saw a completely consistent
state--now it is your turn (again) observing a completely
consistent but new state.
then potentially some other thread could subsequently
take the lock,
Since you are not in the critical section, you don't care.
No critical data has been modified, there is no inconsistent
state to be observed; and YOUR Instruction Pointer is not
pointing into instructions within the ATOMIC event !!
enter the critical section, and observe a data
structure in an inconsistent state (leading to a crash in the
best case).
There is no inconsistent state--in rather the same way a core backs
up a branch mispredict, the core backs up the ATOMIC event to appear
as if it never even started !!
This is precisely the sort of thing we build mutual
exclusion primitives to wrap around critical sections to avoid.
And also why those primitives give rise to priority inversion
and lock-starvation.
If, on the other hand, the atomic primitives are just meant for
the spin loop, then I don't really see how that's all that
useful compared to LL/SC.
Any number of respected people have ask for a pipelined LL-SC;
where one can LL multiple datums and later SC multiple datums.
This is what ESM can provide.
And you still need some way to ensure
that the critical section is not interrupted.
Just a way to detect if it happened; and if it happens;
means to back upto a point where it does not matter if
interference happened "right there"--that is: outside of
a critical section.
In article <684d32898ec85b91bcb9dcdb97d8065a@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
My ATOMIC stuff allows for the construction, in SW, of pretty
much any ATOMIC primitive anyone would like. It is, in general,
not suited to guard critical regions, but to grab the lock which
does.
Ok, that's good to understand.
Most uses of My ATOMIC stuff is to grab/change/update data
that then guards various critical regions. More like a multi-
valued version of LDL-STC.
I don't see how constructing the primitive that way is terribly
useful compared to just issuing multiple load-locked/store-cond
instructions; in particular, it doesn't seem like the sort of
thing that composes well: every sequence that uses it must be
coded to use it, explicitly, instead of just nesting calls to
"lock".
On Mon, 24 Mar 2025 23:45:49 +0000, Dan Cross wrote:
So, if you want the property whereby the lock disappears on any
control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>then you want to use my ATOMIC stuff; otherwise, you can use the
normal ATOMIC primitives everyone and his brother provide.
I definitely do Not want that property. Having a lock
"disappear" on delivery of an interrupt seems like it would be a
great way to introduce silent corruption.
It means that if a thread owning a lock is interrupted, that
thread gives the lock back--so there is no priority inversion
and no lock-starvation. Associated with the give back, and the
IP of the interrupted thread no longer points inside the
critical section.
Nothing gets lost.
mitchalsup@aol.com (MitchAlsup1) writes:
Nothing gets lost.
This isn't a hardware transaction. If the thread has updated data
and you take away the lock, the system is now inconsistent
and nasal daemons will ensue.
cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <684d32898ec85b91bcb9dcdb97d8065a@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
My ATOMIC stuff allows for the construction, in SW, of pretty
much any ATOMIC primitive anyone would like. It is, in general,
not suited to guard critical regions, but to grab the lock which
does.
Ok, that's good to understand.
Most uses of My ATOMIC stuff is to grab/change/update data
that then guards various critical regions. More like a multi-
valued version of LDL-STC.
I don't see how constructing the primitive that way is terribly
useful compared to just issuing multiple load-locked/store-cond >>instructions; in particular, it doesn't seem like the sort of
thing that composes well: every sequence that uses it must be
coded to use it, explicitly, instead of just nesting calls to
"lock".
LL/SC has been shown to be less efficient for contended
locks than atomic operations when the core count exceeds
single digits.
It's difficult to argue against the standard set of atomic
primitives (the PCIe set: compare and swap, load and add,
swap) for scalability; ARM has extended the list to
include exclusive or, set-bit, clear-bit, min, max, et al.
Analysis comparing with LL/SC:
https://cpufun.substack.com/p/atomics-in-aarch64
On Tue, 25 Mar 2025 13:45:39 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
Nothing gets lost.
This isn't a hardware transaction. If the thread has updated data
and you take away the lock, the system is now inconsistent
and nasal daemons will ensue.
In HW there is a little thing known as branch misprediction repair,
where all the work done is made to appear as if those instructions
were never executed to begin with. The size of the branch mispredict
window is on the order of 300 instructions now.
With Spectré and Meltdown even cache updates must be delayed until
after the causing instruction retires.
In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:
In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,Terminology: mutexes coordinate mutual exclusion between threads,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:Consider a thread that takes a spinlock; suppose some
Sometimes you really don't want to be interrupted.And sometimes you don't want to be interrupted unless the
"house is on fire"; I cannot see a time when "the house is
on fire" that you would not want to take the interrupt.
Is there one ?!?
high-priority interrupt comes in while the thread is holding
that lock. In response to the interrupt, software decides to
suspend the thread and switch some other thread; that thread
wants to lock the spin lock that the now-descheduled thread is
holding: a classic deadlock scenario.
spinlocks coordinate mutual exclusion between cpu cores.
Windows "critical sections" are mutexes with a fast path.
A spin lock is simply any lock where you spin trying to acquire
the lock, as opposed to a blocking synchronization protocol.
Here I'm using the terminology of Herlihy and Shavit [Her08].
Traditional Unix "sleep" and "wakeup" are an example of a
blocking protocol, where a thread may "sleep" on some "channel",
yielding the locking thread while the lock cannot be acquired,
presumably scheduling something else until the thread is later
marked runnable by virtual of something else calling "wakeup" on
the sleep channel.
But note that I am deliberately simplifying in order to
construct a particular scenario in which a light-weight
synchronization primitive is being used for mutual exclusion
between concurrent entities, hence mentioning spinlocks
specifically.
Suggesting that spin locks iare only applicable to
multiprocessing scenarios is flatly incorrect.
Software may use
the technique to spin on a "busy" bit on some IP in a device for
example, even on a uniprocessor system.
A valid response here might be, "don't context switch from theThat is exactly what DPC/SoftIrq are design to do - allow the bulk of
interrupt handler; use a DPC instead". That may be valid, but
it puts a constraint on the software designer that may be onerus
in practice: suppose the interrupt happens in the scheduler,
while examining a run queue or something. A DPC object must be
available, etc.
the kernel, like the scheduler, non-paged heap management, IO pre and
post processing, to be interrupted without causing reentrancy.
The higher level device interrupt enqueues a request to the lower software >> interrupt level, which is processed when it will not cause reentrancy.
That constraint is by design and any good OS will crash if violated.
Yes, I'm aware of the technique. That wasn't the point.
Further, software must now consider the complexity ofThis is why I mentioned the terminology thing: threads do not hold
potentially interruptable critical sections. From the
standpoint of reasoning about already-complex concurrency issues
it's simpler to be able to assert that (almost) all interrupt
delivery can be cheaply disabled entirely, save for very
special, specific, scenarios like NMIs. Potentially switching
away from a thread holding a spinlock sort of defeats the
purpose of a spinlock in the first place, which is a mutex
primitive designed to avoid the overhead of switching.
spinlocks, they hold mutexes.
See above. Threads can certainly "hold" a spin lock, as they
can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
page 202:
|On a uniprocessor, if a thread tries to acquire a spin lock
|that is already held, it will loop forever. Multiprocessor
|algorithms, however, must operate correctly regardless of the
|number of processors, which means that they should handle the
|uniprocessor case as well. This requires strict adherence to
|the rule that threads not relinquish control of the CPU while
|holding a spin lock.
Threads and mutexes are a fiction created
by the OS scheduler, and switching threads while waiting for a mutex
is exactly what they are supposed to do.
Spinlocks are used by cores to coordinate access to shared memory by
things like a OS scheduler looking at list of threads waiting for a mutex.
Spinlocks, of the form I was describing, are an optimization for
cases where the critical section guarded by the lock is short,
and the overhead of invoking a scheduler and context switching
to some other thread is higher than just spinning until the lock
becomes available again. The whole point is to avoid blocking
and switching.
E.g., as [Vaj11] puts it (page 98):
|Spin locks have the advantage of preventing pre-emption by the
|OS (due to blocking while waiting for the lock)
That is, the advantage is because the thread does not block and
yield.
I like to think of it as all having to do with hats.
The cpu is wearing one of three hats: a thread hat when it pretends to be
a time shared execution machine; a core hat when it acts as a single
execution machine running non-reentrant things like a scheduler which
creates the fiction of threads and processes (multiple virtual spaces);
and an interrupt hat when executing nestable reentrant interrupts.
The interrupt mechanism coordinates exactly when and how we switch hats.
I suppose you are lumping system calls into the overall bucket
of "interrupts" in this model, regardless of whether one might
use a dedicated supervisor call instruction (e.g., something
like x86 SYSENTER or SYSCALL or ARM SVC or RISC-V's ECALL)?
The hat analogy feels strained to me, and too colloquial to be
useful.
- Dan C.
References:
[Her08] Maurice Herlihy and Nir Shavit. 2008. _The Art of
Multiprocessor Programming_. Morgan Kaufmann, Burlington, MA.
[Vah96] Uresh Vahalia. 1996. _Unix Internals: The New
Frontiers_. Prentice Hall, Upper Saddle River, NJ.
[Vaj11] Andras Vajda. 2011. _Programming Many-Core Chips_.
Springer, New York, NY.
a going to make a big difference here. But I also wonder at the
internal implementation of these instructions in e.g. that
Fujitsu chip that allows throughput to stay more or less
constant as the number of cores increases; I'd think that cache
coherency traffic as threads contend for a single cache line
would create a bottleneck leading to (at least) a linear
slow down.
In article <b201b9212700f9359a831e03b1d60cef@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Tue, 25 Mar 2025 13:45:39 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
Nothing gets lost.
This isn't a hardware transaction. If the thread has updated data
and you take away the lock, the system is now inconsistent
and nasal daemons will ensue.
In HW there is a little thing known as branch misprediction repair,
where all the work done is made to appear as if those instructions
were never executed to begin with. The size of the branch mispredict
window is on the order of 300 instructions now.
With Spectré and Meltdown even cache updates must be delayed until
after the causing instruction retires.
But again, that's irrelevant. The "critical sections" in
question here are software abstractions created by system
software.
a going to make a big difference here. But I also wonder at the
internal implementation of these instructions in e.g. that
Fujitsu chip that allows throughput to stay more or less
constant as the number of cores increases; I'd think that cache
coherency traffic as threads contend for a single cache line
would create a bottleneck leading to (at least) a linear
slow down.
IIUC the idea is that the atomic-add is sent to the memory and performed >there, instead of moving the data to the cache. So there's no
contention for the cache line because the data is not in a cache at all.
Dan Cross wrote:
I didn't.
I am saying the OS by design guards *ITS* different data structures
with different mechanisms, and those mechanism must only be used in
a very controlled manner. And if you want to do something that requires >access to one of *ITS* data structures then you must obey *ITS* rules.
IIUC the idea is that the atomic-add is sent to the memory and performed >>there, instead of moving the data to the cache. So there's no
contention for the cache line because the data is not in a cache at all.
Exactly. The LLC may also support the atomics directly.
Additionally, with PCIe, the atomic can be executed at
the endpoint function.
Dan Cross wrote:
In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:
In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:Consider a thread that takes a spinlock; suppose some
Sometimes you really don't want to be interrupted.And sometimes you don't want to be interrupted unless the
"house is on fire"; I cannot see a time when "the house is
on fire" that you would not want to take the interrupt.
Is there one ?!?
high-priority interrupt comes in while the thread is holding
that lock. In response to the interrupt, software decides to
suspend the thread and switch some other thread; that thread
wants to lock the spin lock that the now-descheduled thread is
holding: a classic deadlock scenario.
Terminology: mutexes coordinate mutual exclusion between threads,
spinlocks coordinate mutual exclusion between cpu cores.
Windows "critical sections" are mutexes with a fast path.
A spin lock is simply any lock where you spin trying to acquire
the lock, as opposed to a blocking synchronization protocol.
Here I'm using the terminology of Herlihy and Shavit [Her08].
I'm using terminology in common use on multiprocessor systems like
VMS since 1980's and later WinNT.
Each OS data structure has different reentrancy requirements.
- Ones accessed by threads are guarded by mutexes.
Example on Windows would be the page tables as paging is done by threads.
- Ones accessed by OS software interrupt level are guarded by spinlocks.
Example on Windows is the Scheduler tables.
- Ones accessed by HW interrupts are guarded by interrupt spinlocks.
Example on Windows is device interrupt service routines.
Traditional Unix "sleep" and "wakeup" are an example of a
blocking protocol, where a thread may "sleep" on some "channel",
yielding the locking thread while the lock cannot be acquired,
presumably scheduling something else until the thread is later
marked runnable by virtual of something else calling "wakeup" on
the sleep channel.
But note that I am deliberately simplifying in order to
construct a particular scenario in which a light-weight
synchronization primitive is being used for mutual exclusion
between concurrent entities, hence mentioning spinlocks
specifically.
Suggesting that spin locks iare only applicable to
multiprocessing scenarios is flatly incorrect.
I didn't.
I am saying the OS by design guards *ITS* different data structures
with different mechanisms, and those mechanism must only be used in
a very controlled manner. And if you want to do something that requires >access to one of *ITS* data structures then you must obey *ITS* rules.
In Windows and VMS and from what I understand of *nix spinlocks guard most
of the core OS data structures with spinlocks acquired at the software >interrupt level. If you are not at that level then the spinlock will not >provide the necessary synchronization for access to that data.
Software may use
the technique to spin on a "busy" bit on some IP in a device for
example, even on a uniprocessor system.
A valid response here might be, "don't context switch from theThat is exactly what DPC/SoftIrq are design to do - allow the bulk of
interrupt handler; use a DPC instead". That may be valid, but
it puts a constraint on the software designer that may be onerus
in practice: suppose the interrupt happens in the scheduler,
while examining a run queue or something. A DPC object must be
available, etc.
the kernel, like the scheduler, non-paged heap management, IO pre and
post processing, to be interrupted without causing reentrancy.
The higher level device interrupt enqueues a request to the lower software >>> interrupt level, which is processed when it will not cause reentrancy.
That constraint is by design and any good OS will crash if violated.
Yes, I'm aware of the technique. That wasn't the point.
Further, software must now consider the complexity of
potentially interruptable critical sections. From the
standpoint of reasoning about already-complex concurrency issues
it's simpler to be able to assert that (almost) all interrupt
delivery can be cheaply disabled entirely, save for very
special, specific, scenarios like NMIs. Potentially switching
away from a thread holding a spinlock sort of defeats the
purpose of a spinlock in the first place, which is a mutex
primitive designed to avoid the overhead of switching.
This is why I mentioned the terminology thing: threads do not hold
spinlocks, they hold mutexes.
See above. Threads can certainly "hold" a spin lock, as they
can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
page 202:
When a particular OS defined spinlock is used to guard a particular OS >defined data structure, exactly WHEN the OS allows this decides how
the OS avoids all the nasty interrupt pitfalls you identified.
|On a uniprocessor, if a thread tries to acquire a spin lock
|that is already held, it will loop forever. Multiprocessor
|algorithms, however, must operate correctly regardless of the
|number of processors, which means that they should handle the
|uniprocessor case as well. This requires strict adherence to
|the rule that threads not relinquish control of the CPU while
|holding a spin lock.
Yes, which is what I was saying.
I am describing HOW that is accomplished inside an OS.
The way you prevent a thread from relinquishing control while holding a >spinlock is to first switch hats from thread level to software interrupt >level, which blocks the scheduler because it also runs at SWIL,
then acquire the spinlock.
Raising the interrupt priority level from passive (aka thread) level
to SWI level (called dispatch level on Windows) blocks the thread switch. >This ensures non-reentrant algorithms are serialized.
The core may then acquire spinlocks guarding data structures.
Threads and mutexes are a fiction createdSpinlocks, of the form I was describing, are an optimization for
by the OS scheduler, and switching threads while waiting for a mutex
is exactly what they are supposed to do.
Spinlocks are used by cores to coordinate access to shared memory by
things like a OS scheduler looking at list of threads waiting for a mutex. >>
cases where the critical section guarded by the lock is short,
and the overhead of invoking a scheduler and context switching
to some other thread is higher than just spinning until the lock
becomes available again. The whole point is to avoid blocking
and switching.
E.g., as [Vaj11] puts it (page 98):
|Spin locks have the advantage of preventing pre-emption by the
|OS (due to blocking while waiting for the lock)
That is, the advantage is because the thread does not block and
yield.
And I am describing HOW the OS accomplishes this.
Disabling the scheduler is the same as raising the priority to SWI level. >Once the scheduler is disabled THEN you may acquire a spinlock
that guards a shared data structure accessed at SWI level.
But you may NOT access a shared data structure accessed at interrupt level. >For that you'd need an interrupt spinlock acquired at the device ISR
priority interrupt level. And note that the ISR interrupt priority level
is assigned by the OS itself, usually when the device is mounted.
I like to think of it as all having to do with hats.
The cpu is wearing one of three hats: a thread hat when it pretends to be >>> a time shared execution machine; a core hat when it acts as a single
execution machine running non-reentrant things like a scheduler which
creates the fiction of threads and processes (multiple virtual spaces);
and an interrupt hat when executing nestable reentrant interrupts.
The interrupt mechanism coordinates exactly when and how we switch hats.
I suppose you are lumping system calls into the overall bucket
of "interrupts" in this model, regardless of whether one might
use a dedicated supervisor call instruction (e.g., something
like x86 SYSENTER or SYSCALL or ARM SVC or RISC-V's ECALL)?
Syscall switches a thread from user mode to supervisor mode.
The hat analogy feels strained to me, and too colloquial to be
useful.
I offered it because it answers your prior questions about how
OS avoid the kinds of deadlocks and race conditions you described.
a going to make a big difference here. But I also wonder at the
internal implementation of these instructions in e.g. that
Fujitsu chip that allows throughput to stay more or less
constant as the number of cores increases; I'd think that cache
coherency traffic as threads contend for a single cache line
would create a bottleneck leading to (at least) a linear
slow down.
IIUC the idea is that the atomic-add is sent to the memory and performed >there, instead of moving the data to the cache. So there's no
contention for the cache line because the data is not in a cache at all.
Stefan Monnier <monnier@iro.umontreal.ca> writes:
a going to make a big difference here. But I also wonder at theIIUC the idea is that the atomic-add is sent to the memory and performed
internal implementation of these instructions in e.g. that
Fujitsu chip that allows throughput to stay more or less
constant as the number of cores increases; I'd think that cache
coherency traffic as threads contend for a single cache line
would create a bottleneck leading to (at least) a linear
slow down.
there, instead of moving the data to the cache. So there's no
contention for the cache line because the data is not in a cache at all.
Exactly. The LLC may also support the atomics directly.
Additionally, with PCIe, the atomic can be executed at
the endpoint function.
On Tue, 25 Mar 2025 15:10:19 +0000, Dan Cross wrote:
In article <b201b9212700f9359a831e03b1d60cef@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Tue, 25 Mar 2025 13:45:39 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
Nothing gets lost.
This isn't a hardware transaction. If the thread has updated data
and you take away the lock, the system is now inconsistent
and nasal daemons will ensue.
In HW there is a little thing known as branch misprediction repair,
where all the work done is made to appear as if those instructions
were never executed to begin with. The size of the branch mispredict >>>window is on the order of 300 instructions now.
With Spectré and Meltdown even cache updates must be delayed until
after the causing instruction retires.
But again, that's irrelevant. The "critical sections" in
question here are software abstractions created by system
software.
Say you have a critical section (of your description) that was
preceded by an if-statement that was predicted to enter CS, but
1,000 cycles later the branch is resolved to not enter the CS.
{{Branch was dependent on an MMI/O location down the PCIe tree
which is why it took so long to resolve.}}
HW Backs the core up so that the RF and all memory locations
contain their original values at the point of the mispredicted
branch.
Can thread running on core even tell if you entered CS or not !
No, your thread cannot.
Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
a going to make a big difference here. But I also wonder at theIIUC the idea is that the atomic-add is sent to the memory and performed >>> there, instead of moving the data to the cache. So there's no
internal implementation of these instructions in e.g. that
Fujitsu chip that allows throughput to stay more or less
constant as the number of cores increases; I'd think that cache
coherency traffic as threads contend for a single cache line
would create a bottleneck leading to (at least) a linear
slow down.
contention for the cache line because the data is not in a cache at all.
Exactly. The LLC may also support the atomics directly.
Additionally, with PCIe, the atomic can be executed at
the endpoint function.
Those features are possible but there is no indication that is
what that web page is measuring.
I agree with Dan that looks more like a test of the cache efficiency or
the cache line pinning mechanism than a test of LL/SC vs AtomicXxxx.
First, it doesn't say what systems these are on.
It says the cores is Cavium Thunder X2 and Fujitsu A64FX
but does not describe the systems.
For each platform it needed to establish a base line performance for
moving lines between caches, the same ++ loops but without atomics.
Also Arm is weak ordered. At least one version of those test loops should >have fences otherwise this doesn't tell you anything about how it might >behave in real use.
In article <0343529d63f68c76b5e0d227ef6e39dd@www.novabbs.org>,--------------------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
Say you have a critical section (of your description) that was
preceded by an if-statement that was predicted to enter CS, but
1,000 cycles later the branch is resolved to not enter the CS.
{{Branch was dependent on an MMI/O location down the PCIe tree
which is why it took so long to resolve.}}
HW Backs the core up so that the RF and all memory locations
contain their original values at the point of the mispredicted
branch.
Can thread running on core even tell if you entered CS or not !
No, your thread cannot.
Again, how is that relevant? The branch predictor can only
handle speculative execution of so-many instructions.
Attempts
to perform more instructions than that will stall the pipeline.
A critical section as defined by software may contain more than
that number of instructions.
A critical section, as defined by
software, may touch more cache lines than your atomic event
proposal can handle.
Thus, for your architecture to be useful,
you must provide some mechanism for handling those
eventualities.
Your stuff sounds fine for getting into the critsec; but it
doesn't help you once you're in it, and it seems like it may
hurt you if some higher-priority thing comes along and yanks
your lock out from under you.
- Dan C.
On 3/25/2025 4:20 AM, Dan Cross wrote:
In article <vrt37r$27c13$4@dont-email.me>,
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 3/24/2025 1:28 PM, Dan Cross wrote:
There's no guarantee that the tail is the correct place to
insert the element, vis the order of the elements.
https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html
If you go back and review the thread, you'd see that the problem
is to remove an element from a queue, and insert it into an
ordered list. This part of the discussion is concerning the
latter operation.
There are lock-free ways to insert into an ordered list, indeed.
From what I gather, you've said to just eschew the atomic event
stuff and build a traditional lock using those primitives. But
you have also said that a traditional lock can be silently
unlocked if the thread holding it (which, apparently, is an
abstraction known to hardware) is preempted. It will be like
the lock had never been acquired.
On Tue, 25 Mar 2025 17:54:42 +0000, Dan Cross wrote:
In article <0343529d63f68c76b5e0d227ef6e39dd@www.novabbs.org>,--------------------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
Say you have a critical section (of your description) that was
preceded by an if-statement that was predicted to enter CS, but
1,000 cycles later the branch is resolved to not enter the CS.
{{Branch was dependent on an MMI/O location down the PCIe tree
which is why it took so long to resolve.}}
HW Backs the core up so that the RF and all memory locations
contain their original values at the point of the mispredicted
branch.
Can thread running on core even tell if you entered CS or not !
No, your thread cannot.
Again, how is that relevant? The branch predictor can only
handle speculative execution of so-many instructions.
Around 300 at present.
Know of any critical sections that long?
Attempts
to perform more instructions than that will stall the pipeline.
Obviously.
A critical section as defined by software may contain more than
that number of instructions.
Sure, its possible.
But if DECODE cannot push the entire critical section into the
execution window, then BY DEFINITION, it MUST be able to make
the entire critical section appear as if you never started on
it !!!
IP gets backed up as if you never entered CS
RF gets backed up as if you never entered CS
LOCK variable never got written to SNOOPable cache where others can see
it
Cache memory gets put back in the "you never got here" states too.
[...a lot of the same stuff, repeated over and over snipped]
A critical section, as defined by
software, may touch more cache lines than your atomic event
proposal can handle.
Sure, but then you would only be using my ATOMICs for locking/
unlocking not for the work going on.
Thus, for your architecture to be useful,
you must provide some mechanism for handling those
eventualities.
As stated about 15 responses ago, just use them as ATOMIC primitive >generators.
Your stuff sounds fine for getting into the critsec; but it
doesn't help you once you're in it, and it seems like it may
hurt you if some higher-priority thing comes along and yanks
your lock out from under you.
I conceded that point at least 2 days ago.
On 3/25/2025 3:21 PM, Dan Cross wrote:
In article <vrv483$3ekk$2@dont-email.me>,
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 3/25/2025 4:20 AM, Dan Cross wrote:
In article <vrt37r$27c13$4@dont-email.me>,
Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
On 3/24/2025 1:28 PM, Dan Cross wrote:
There's no guarantee that the tail is the correct place to
insert the element, vis the order of the elements.
https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html
If you go back and review the thread, you'd see that the problem
is to remove an element from a queue, and insert it into an
ordered list. This part of the discussion is concerning the
latter operation.
There are lock-free ways to insert into an ordered list, indeed.
How is that relevant?
inserting into an ordered list is not relevant?
Dan Cross [2025-03-25 22:19:07] wrote:
From what I gather, you've said to just eschew the atomic event
stuff and build a traditional lock using those primitives. But
you have also said that a traditional lock can be silently
unlocked if the thread holding it (which, apparently, is an
abstraction known to hardware) is preempted. It will be like
the lock had never been acquired.
Here's the misunderstanding: the lock that can be silently taken away
is the lock implemented in the CPU to run an ATOMIC sequence, not the
lock implemented as a data-structure on top of it: once you've run >(atomically) your `lock_acquire`, of course that lock is yours. The CPU >doesn't even know that this data-structure is a lock and has no
way to steal it.
Dan Cross [2025-03-25 22:19:07] wrote:
From what I gather, you've said to just eschew the atomic event
stuff and build a traditional lock using those primitives. But
you have also said that a traditional lock can be silently
unlocked if the thread holding it (which, apparently, is an
abstraction known to hardware) is preempted. It will be like
the lock had never been acquired.
Here's the misunderstanding: the lock that can be silently taken away
is the lock implemented in the CPU to run an ATOMIC sequence, not the
lock implemented as a data-structure on top of it: once you've run (atomically) your `lock_acquire`, of course that lock is yours. The CPU doesn't even know that this data-structure is a lock and has no
way to steal it.
Stefan
In article <e4ecfb53ef4f33f10b6d2548d5930159@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
Your stuff sounds fine for getting into the critsec; but it
doesn't help you once you're in it, and it seems like it may
hurt you if some higher-priority thing comes along and yanks
your lock out from under you.
I conceded that point at least 2 days ago.
Ok, now we're back to the question then: what purpose does it
serve?
It doesn't seem like it can be usefully applied to
anything other than building traditional mutexes based on spin
locks, and even then I'm not sure if that's correct (see above).
- Dan C.
Even so, I don't see the point of the proposed atomic framework.
It seems to have no significant advantage over a more
traditional scheme; CAS, LL-SC, whatever seem more general and
less intertwined with primitives that the hardware has no real
business injecting itself into (like threads).
On Tue, 25 Mar 2025 22:19:07 +0000, Dan Cross wrote:
In article <e4ecfb53ef4f33f10b6d2548d5930159@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
Your stuff sounds fine for getting into the critsec; but it
doesn't help you once you're in it, and it seems like it may
hurt you if some higher-priority thing comes along and yanks
your lock out from under you.
I conceded that point at least 2 days ago.
Ok, now we're back to the question then: what purpose does it
serve?
You (well Chris) can write critical sections where a low priority
holder can give way to the higher priority requestor such that
the low priority holder reverts back to before it entered CS,
and does not see the lock being stolen.
So, if you want the property whereby the lock disappears on any
control transfer out of the event {exception, interrupt, SVC, SVR, ...};
then you want to use my ATOMIC stuff; otherwise, you can use the
normal ATOMIC primitives everyone and his brother provide.
For the sizes of CSs this covers--it prevent priority inversion.
It doesn't seem like it can be usefully applied to
anything other than building traditional mutexes based on spin
locks, and even then I'm not sure if that's correct (see above).
And here I thought you had an imagination .....
Even so, I don't see the point of the proposed atomic framework.
It seems to have no significant advantage over a more
traditional scheme; CAS, LL-SC, whatever seem more general and
less intertwined with primitives that the hardware has no real
business injecting itself into (like threads).
AFAICT it's somewhat similar to ARM's LL/SC (which can also cover some
number of cache lines) except that it takes care of looping for you: if
the final store fails because of interference My 66000 will "simply" abort >the whole operation (reusing the speculation hardware) and start over, >instead of having to write an explicit loop.
Without actual hardware and software to try it's hard to assess how well
it's going to work, but the design sounds good: it should lead to more >compact code than LL/SC, be more general/friendly to lock-free coding
than T&S/C&S, and also compared to LL/SC it should behave better under
high contention because it is careful to guarantee overall forward
progress (in case of conflict between two cores, at least one of them
can finish successfully).
IIRC there's also some kind of mechanism to retry more cleverly, i.e. so
that when N cores are in conflicts, not only one of them succeeds, but
the N-1 remaining cores have a fighting chance of avoiding bumping into
each other next time around. I don't know the details, tho, and IIUC
this is not automatic: some relevant info is provided but it's up to the
code to do something useful with it.
On Tue, 18 Mar 2025 18:51:57 +0000, EricP wrote:
This affects what it can do with PCR inside the WoU if it causes a state
change that cannot be undone, for example, reading the interrupt data
FIFO.
Since the device is accessed through MMI/O space, and since My 66000
demands sequential consistency for MMI/O accesses, the LD of the
interrupt data FIFO cannot leave the confines of the core until
all pending memory references from the interrupted thread have
"been performed".
On Tue, 18 Mar 2025 21:59:51 +0000, Robert Finch wrote:
On 2025-03-18 2:51 p.m., EricP wrote:----------------
Seems like it could work. There are difficulties using counters with the
Q+ ISA as multiple interrupt enable bits could be manipulated at the
same time. One interrupt may be enabled and another one disabled by the
same instruction.
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
When running at any privilege, do you not want to accept an interrupt
at higher priority with as little delay as possible ??
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
When running at any privilege, do you not want to accept an interrupt
at higher priority with as little delay as possible ??
Does Interrupt Disable help in this goal ??
Is there a SANE way to avoid disabling of interrupts ??
On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:[...]
Method 1 is simplest, it injects the interrupt request at Retire
as that's where the state of everything is synchronized.
Lowest interrupt Latency[...]
Highest waste of power (i.e., work)
Method 2 pipelines the switch by injecting the interrupt request at
Fetch.
Interrupt latency is dependent on executing instructions,
Lowest waste of power
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 18 Mar 2025 21:59:51 +0000, Robert Finch wrote:
On 2025-03-18 2:51 p.m., EricP wrote:----------------
Seems like it could work. There are difficulties using counters with the >>> Q+ ISA as multiple interrupt enable bits could be manipulated at the
same time. One interrupt may be enabled and another one disabled by the
same instruction.
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
Because sometimes you need to. For example, when making a configuration change to the device, you'll want to disable interrupts that may be
generated as a result of the configuration change (e.g. when flushing
a FIFO on the device, you may not want the 'FIFO Empty' interrupt
to be asserted).
Granted, there might be an enable gate on the device itself
that can be used for that, but generally speaking there are other cases
where it will make sense to disable delivery at the interrupt controller
or core (the later, particularly, during certain critical sections).
When running at any privilege, do you not want to accept an interrupt
at higher priority with as little delay as possible ??
Not while the core is executing an OS/hypervisor/device driver critical section sensitive to interrupts.
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 18 Mar 2025 18:51:57 +0000, EricP wrote:
This affects what it can do with PCR inside the WoU if it causes a state >>> change that cannot be undone, for example, reading the interrupt data
FIFO.
Since the device is accessed through MMI/O space, and since My 66000 >>demands sequential consistency for MMI/O accesses, the LD of the
interrupt data FIFO cannot leave the confines of the core until
all pending memory references from the interrupted thread have
"been performed".
Be careful when relying on the ordering of outbound MMI/O accesses.
They're not
ordered with respect to DMA accesses initiated from the device.
With PCI Express, software can also enable RO (Relaxed Ordering) which changes the normal fairly strict ordering rules, particulary with
respect
to non-posted transactions vis a vis posted transactions.
In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
So that you can have some mechanism for building critical
sections that span multiple instructions, where just having
access to atomics and spin locks isn't enough.
When running at any privilege, do you not want to accept an interrupt
at higher priority with as little delay as possible ??
Usually the assumption is that the critical section is short; a
handful of instructions, perhaps, so that the cost of blocking a
high priority interrupt is low.
Does Interrupt Disable help in this goal ??
This is predicated on an incomplete view of the situation and
the intersection between hardware and software. Serving high
priority interrupts quickly is not the only consideration.
Is there a SANE way to avoid disabling of interrupts ??
It depends on your definition of "sane" in this context. One
may imagine redesigning all of a system's locking primitives,
perhaps, to avoid the need to block interrupts in most cases.
But that pushes a lot of complexity onto software for dubious
benefit, and even then, it's not clear to me that they can be
blocked in _all_ cases.
- Dan C.
On Wed, 19 Mar 2025 13:49:54 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 18 Mar 2025 21:59:51 +0000, Robert Finch wrote:
On 2025-03-18 2:51 p.m., EricP wrote:----------------
Seems like it could work. There are difficulties using counters with the >>>> Q+ ISA as multiple interrupt enable bits could be manipulated at the
same time. One interrupt may be enabled and another one disabled by the >>>> same instruction.
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
Because sometimes you need to. For example, when making a configuration
change to the device, you'll want to disable interrupts that may be
generated as a result of the configuration change (e.g. when flushing
a FIFO on the device, you may not want the 'FIFO Empty' interrupt
to be asserted).
The above is more quickly performed by raising the priority level above
those of interrupts a device may send.
MitchAlsup1 [2025-03-17 21:49:18] wrote:
On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:[...]
Method 1 is simplest, it injects the interrupt request at Retire
as that's where the state of everything is synchronized.
Lowest interrupt Latency[...]
Highest waste of power (i.e., work)
Method 2 pipelines the switch by injecting the interrupt request at
Fetch.
Interrupt latency is dependent on executing instructions,
Lowest waste of power
What do existing CPUs do?
Method 2 looks insanely messy, so I'd assume method 1 is the standard approach, right?
Stefan
I am going for a simple approach. Since the CSR instruction is used to manipulate the interrupt enable bits in the status register, if there is
any CSR instruction in the pipeline, then interrupts will be masked off
until the CSR clears. One issue with this type of approach is evil
software could issue a lot of CSRs preventing interrupt servicing from happening. I Could fine tune the decode to sort out CSRs just affecting
the interrupt flags. It always seems to be more logic.
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 19 Mar 2025 13:49:54 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 18 Mar 2025 21:59:51 +0000, Robert Finch wrote:
On 2025-03-18 2:51 p.m., EricP wrote:----------------
Seems like it could work. There are difficulties using counters with the >>>>> Q+ ISA as multiple interrupt enable bits could be manipulated at the >>>>> same time. One interrupt may be enabled and another one disabled by the >>>>> same instruction.
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
Because sometimes you need to. For example, when making a configuration >>> change to the device, you'll want to disable interrupts that may be
generated as a result of the configuration change (e.g. when flushing
a FIFO on the device, you may not want the 'FIFO Empty' interrupt
to be asserted).
The above is more quickly performed by raising the priority level above >>those of interrupts a device may send.
Most device interrupts have the same priority. So your solution blocks
all interrupts effectively,
even though only the interrupt for the particular device owned by the device driver needs to be blocked.* Enabling really needs to be done on a per-device-instance basis.
There aren't enough priorities to have a separate priority for each
device instance (e.g. 1000 virtual functions owned by a single PF,
each with 10 individual interrupts).
If every device was required to have different priorities, you have
a nightmare scenario just trying to figure out which device interrupt
should be where in the priority hierarchy to avoid potential deadlocks.
even though only the interrupt for the particular device owned by the device driver needs to be blocked.*
On 2025-03-19 12:00 p.m., MitchAlsup1 wrote:
On Wed, 19 Mar 2025 15:20:20 +0000, Stefan Monnier wrote:Undaunted he tries to implement the method 2 approach using an IRQ
MitchAlsup1 [2025-03-17 21:49:18] wrote:
On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:[...]
Method 1 is simplest, it injects the interrupt request at Retire
as that's where the state of everything is synchronized.
Lowest interrupt Latency[...]
Highest waste of power (i.e., work)
Method 2 pipelines the switch by injecting the interrupt request at
Fetch.
Interrupt latency is dependent on executing instructions,
Lowest waste of power
What do existing CPUs do?
Lean in the direction of method 2 will fallback position of method 1
if anything goes wrong.
Method 2 looks insanely messy, so I'd assume method 1 is the standard
approach, right?
It is "not that bad" on the large scale of things.
victim queue for IRQs that should not be processed at retire. It is not
too bad to code, but I dunno if it works yet. Just a switch at fetch
time to select from the victim queue first, otherwise pop the normal IRQ queue. The queue is small, assuming that IRQs are not disabled for great lengths of time.
Stefan
In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
So that you can have some mechanism for building critical
sections that span multiple instructions, where just having
access to atomics and spin locks isn't enough.
When running at any privilege, do you not want to accept an interrupt
at higher priority with as little delay as possible ??
Usually the assumption is that the critical section is short; a
handful of instructions, perhaps, so that the cost of blocking a
high priority interrupt is low.
Does Interrupt Disable help in this goal ??
This is predicated on an incomplete view of the situation and
the intersection between hardware and software. Serving high
priority interrupts quickly is not the only consideration.
Is there a SANE way to avoid disabling of interrupts ??
It depends on your definition of "sane" in this context. One
may imagine redesigning all of a system's locking primitives,
perhaps, to avoid the need to block interrupts in most cases.
But that pushes a lot of complexity onto software for dubious
benefit, and even then, it's not clear to me that they can be
blocked in _all_ cases.
- Dan C.
Leaving aside the unchecked boundary conditions I highlighted
above, there are other questions that this raises. You
mentioned that the ESM stuff can use up to 8 cache lines in a
single atomic "event"; let's count how many this code may
encounter:
1. `from` may point to a cache line
2. `to` may point to a different cache line.
3. Does your architecture use a stack? What sort of alignment
criteria do you impose? At first blush, it seems to me that
the pointers `from_next`, `from_prev`, and `to_next` could be
on the stack and if so, will be on at least one cache line,
and possibly 2, if the call frame for `MoveElement` spans
more than one. Of course, it's impossible to say when this
is compiled, but maybe you require that the stack is always
aligned to a cache line boundary or something. Or maybe
these are all in registers and it's fine.
4. Depending on the size and alignment criteria of the the
`Element` structure and the placement of the `next` and
`prev` elements in the struct (unknowable from this example),
`to->next` may be on another cache line.
So potentially this code _could_ touch 9 or 10 cache lines, more
than you said are supported for a single atomic event. What
happens in that case? Does the code just always return false?
If I understand this correctly, the `esmLOCKload`s are simply
loads, but they make sure that the compiler emits instructions
that tie into your atomic support framework. Earlier you wrote, "esmINTERFERENCE is a query on the state of the event and whether
a 2nd party has performed a memory action that kills the event."
Ok, I take this to mean _at the time of the store_, which
appears later in the program.
If it were simply a query at the time it was written in the code, then
it seems like it opens a TOCTOU bug (what if something interfers after
the check, but before the the locked store at the end?).
Assuming nothing has poisoned the event, the various pointer
stores inside the conditional are conditionally performed; the
`esmLOCKstore` to `from->next` commits the event, at which
point all side effects are exposed and durable.
So now I, as a systems software person, have to ask design
question around how I want to use these architectural features,
if at all, weighing various pros and cons. This is where I'm
struggling with your proposal: given that I'm going to have to
implement more general support for the things that the
architecture gives me anyway, what do I gain by also supporting
the specialized mechanisms?
Right now, I'm not seeing it that it is. "It solves priority
inversion for the size of problems that it's applicable to" is
not enough to justify the additional cost of adoption. I have
to solve that problem anyway, whether I use this or not.
Leaving aside the unchecked boundary conditions I highlighted
above, there are other questions that this raises. You
mentioned that the ESM stuff can use up to 8 cache lines in a
single atomic "event"; let's count how many this code may
encounter:
1. `from` may point to a cache line
2. `to` may point to a different cache line.
3. Does your architecture use a stack? What sort of alignment
criteria do you impose? At first blush, it seems to me that
the pointers `from_next`, `from_prev`, and `to_next` could be
on the stack and if so, will be on at least one cache line,
and possibly 2, if the call frame for `MoveElement` spans
more than one. Of course, it's impossible to say when this
is compiled, but maybe you require that the stack is always
aligned to a cache line boundary or something. Or maybe
these are all in registers and it's fine.
AFAIK the stack accesses will use normal loads that won't be tracked for >conflicts, so they don't count towards the 8 cache-lines limit.
4. Depending on the size and alignment criteria of the the
`Element` structure and the placement of the `next` and
`prev` elements in the struct (unknowable from this example),
`to->next` may be on another cache line.
The coder may need to be careful with such slot placement and
alignment, indeed.
So potentially this code _could_ touch 9 or 10 cache lines, more
than you said are supported for a single atomic event. What
happens in that case? Does the code just always return false?
I believe the code will indeed always return false in that case.
If I understand this correctly, the `esmLOCKload`s are simply
loads, but they make sure that the compiler emits instructions
that tie into your atomic support framework. Earlier you wrote,
"esmINTERFERENCE is a query on the state of the event and whether
a 2nd party has performed a memory action that kills the event."
Ok, I take this to mean _at the time of the store_, which
appears later in the program.
No, I believe it is at the time when `esmINTERFERENCE` is called.
If it were simply a query at the time it was written in the code, then
it seems like it opens a TOCTOU bug (what if something interfers after
the check, but before the the locked store at the end?).
IIUC the way this works is that in case there's a conflict later, the
other core will get a NACK (i.e. we'll win and they'll lose, even if
they have a higher priority) so we're guaranteed that we'll finish our >transaction successfully. Of course, that can be guaranteed only if we >already have "locked" the cache lines we'll need, which is what the >`esmLOCKprefetch` calls are about.
IOW the `esmINTERFERENCE` call is the time where the CPU decides whether
to "commit", and then the code between `esmINTERFERENCE` and
`esmLOCKstore` is the code which actually stores the changes we
decided to commit.
IIUC, the `esmINTERFERENCE` call is not needed, in which case the
operation should retry as many times as needed to succeed.
Assuming nothing has poisoned the event, the various pointer
stores inside the conditional are conditionally performed; the
`esmLOCKstore` to `from->next` commits the event, at which
point all side effects are exposed and durable.
The `esmLOCKstore` marks the end of the transaction. IIUC it can also
mark the "commit" but only if `esmINTERFERENCE` isn't used.
So now I, as a systems software person, have to ask design
question around how I want to use these architectural features,
if at all, weighing various pros and cons. This is where I'm
struggling with your proposal: given that I'm going to have to
implement more general support for the things that the
architecture gives me anyway, what do I gain by also supporting
the specialized mechanisms?
I think the expectation is that you'd *use* those specialized mechanisms
but you wouldn't *support* them.
Right now, I'm not seeing it that it is. "It solves priority
inversion for the size of problems that it's applicable to" is
not enough to justify the additional cost of adoption. I have
to solve that problem anyway, whether I use this or not.
The ESM stuff is what you'll use at the assembly-level instead of T&S,
C&S, LL/SC because that's what the CPU supports/provides. But if you
prefer to code against T&S, C&S, LL/SC, AFAIK it'll work as well,
because you can "simulate" them easily with ESM.
uint32_t
xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
{
uint32_t v = esmLOCKload(&lk->lock);
if (v == expected) esmLOCKstore(&lk->lock, val);
return v;
}
Yes, unless you want the alternate exit strategy, this works fine.
Even so, I don't see the point of the proposed atomic framework.
It seems to have no significant advantage over a more
traditional scheme; CAS, LL-SC, whatever seem more general and
less intertwined with primitives that the hardware has no real
business injecting itself into (like threads).
AFAICT it's somewhat similar to ARM's LL/SC (which can also cover some
number of cache lines) except that it takes care of looping for you: if
the final store fails because of interference My 66000 will "simply"
abort
the whole operation (reusing the speculation hardware) and start over, instead of having to write an explicit loop.
Without actual hardware and software to try it's hard to assess how well
it's going to work, but the design sounds good: it should lead to more compact code than LL/SC, be more general/friendly to lock-free coding
than T&S/C&S, and also compared to LL/SC it should behave better under
high contention because it is careful to guarantee overall forward
progress (in case of conflict between two cores, at least one of them
can finish successfully).
IIRC there's also some kind of mechanism to retry more cleverly, i.e. so
that when N cores are in conflicts, not only one of them succeeds, but
the N-1 remaining cores have a fighting chance of avoiding bumping into
each other next time around. I don't know the details, tho, and IIUC
this is not automatic: some relevant info is provided but it's up to the
code to do something useful with it.
Stefan
Remember only the lines participating in the event count. For
example, say one had sprintf() inside the event. None of the
lines touched by sprint() are participating in the event,
and are not counted against the 8 cache lines available.
In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,-----------------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
You (well Chris) can write critical sections where a low priority
holder can give way to the higher priority requestor such that
the low priority holder reverts back to before it entered CS,
and does not see the lock being stolen.
You still have not defined what you mean by, "the lock being
stolen."
Tell you what. Earlier, you posted some code for a critical
section that inserts a node into a linked list. That code is
written in terms of some primitives (presumably intrinsics,
though you call them macros); `esmLOCKload`, `esmLOCKprefetch`, `esmINTERFERENCE` and `esmLOCKstore`. Here's that code, slightly reformatted:
// Removes the element `from` from a doubly linked list, and
// inserts it into a (possibly different) doubly linked list,
// after the element pointed to be `to`.
// Assumes <stdbool.h>
bool
MoveElement(Element *from, Element *to)
{
assert(from != NULL);
assert(to != NULL);
Element *from_next = esmLOCKload(from->next);
Element *from_prev = esmLOCKload(from->prev);
Element *to_next = esmLOCKload(to->next);
esmLOCKprefetch(from_next);
esmLOCKprefetch(from_prev);
esmLOCKprefetch(to_next);
if (!esmINTERFERENCE()) {
// Unlink the "from" node from its list.
from_prev->next = from_next; // XXX: `from_prev` may be nil?
from_next->prev = from_prev; // XXX: `from_next` may be nil?
// Insert the node after "to".
to->next = from;
to_next->prev = from; // XXX: `to_net` may be nil?
from->prev = to;
esmLOCKstore(from->next, to_next);
return true;
}
return false;
}
Leaving aside the unchecked boundary conditions I highlighted
above, there are other questions that this raises. You
mentioned that the ESM stuff can use up to 8 cache lines in a
single atomic "event"; let's count how many this code may
encounter:
1. `from` may point to a cache line
2. `to` may point to a different cache line.
3. Does your architecture use a stack?
What sort of alignment criteria do you impose?
At first blush, it seems to me that
the pointers `from_next`, `from_prev`, and `to_next` could be
on the stack and if so, will be on at least one cache line,
and possibly 2, if the call frame for `MoveElement` spans
more than one.
Of course, it's impossible to say when this
is compiled, but maybe you require that the stack is always
aligned to a cache line boundary or something.
Or maybe
these are all in registers and it's fine.
4. Depending on the size and alignment criteria of the the
`Element` structure and the placement of the `next` and
`prev` elements in the struct (unknowable from this example),
`to->next` may be on another cache line.
5. Similarly, `from->next`
6. And `from->prev`
7. And `from_prev->next`
8. And `from_next->prev`
9. And `to_next->prev`.
So potentially this code _could_ touch 9 or 10 cache lines, more
than you said are supported for a single atomic event. What
happens in that case? Does the code just always return false?
The programmer best make sure the optimizer is keeping those
temporary pointers off the stack; good luck with a debug build.
Anyway, let's assume that's not the case, and we don't hit the
pathological case with cache lines, and we're not dealing with
edge cases at the front or end of the lists (so various next or
prev pointers are not NULL). Why don't I try to explain what I
think this code is doing, and you tell me whether I'm right or
wrong.
If I understand this correctly, the `esmLOCKload`s are simply
loads, but they make sure that the compiler emits instructions
that tie into your atomic support framework.
Earlier you wrote, "esmINTERFERENCE is a query on the state of th event and whether
a 2nd party has performed a memory action that kills the event."
Ok, I take this to mean _at the time of the store_, which
appears later in the program. If it were simply a query at the
time it was written in the code, then it seems like it opens a
TOCTOU bug (what if something interfers after the check, but
before the the locked store at the end?). Or perhaps it signals
to the architecture that this is the point to which the
processor should rollback, if anything subsequently fails; sort
of a `setjmp` kind of thing that conceptually returns twice.
Assuming nothing has poisoned the event, the various pointer
stores inside the conditional are conditionally performed; the
`esmLOCKstore` to `from->next` commits the event, at which
point all side effects are exposed and durable.
Is that correct? Let's suppose that it is.
But later you wrote,
So, if you want the property whereby the lock disappears on any
control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>then you want to use my ATOMIC stuff; otherwise, you can use the
normal ATOMIC primitives everyone and his brother provide.
Well, what precisely do you mean here when you say, "if you want
the property whereby the lock disappears on any control transfer
out of the event"?
What I _hope_ you mean is that if you transfer "out of the
event" before the `esmLOCKstore` then anything that's been done
since the "event" started is rolled back, but NOT if control
transfer happens _after_ the `esmLOCKstore`. If THAT were the
case, then this entire mechanism is useless.
Perhaps you would clarify?
But for the time being, let's assume that `esmLOCKstore` is
conceptually a commit that ends the atomic event, and the store
is durable and observable after that. Let's see if we can
construct a simple spinlock from these primitives.
typedef struct Spinlock Spinlock;
struct Spinlock {
uint32_t lock;
};
uint32_t
xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
{
uint32_t v = esmLOCKload(&lk->lock);
if (v == expected && !esmINTERFERENCE()) {
esmLOCKstore(&lk->lock, val)
}
return v;
}
void
spin_acquire(volatile Spinlock *lk)
{
splhi();
while (xcas(lk, 0, 1) != 0)
relax(); // Spin loop intrinsic.
}
void
spin_release(volatile Spinlock *lk)
{
xcas(lk, 1, 0);
// Maybe we should `spllo` here, maybe not. Either choice
// reflects lots of assumptions about whether or not we might
// take and hold multiple locks concurrently, etc.
spllo();
}
Does this seem correct? It is not immediately clear to me
whether `esmINTERFERENCE` is actually needed here, but
avoiding the store seemed fine.
One could imagine shorting
`xcas` to simply:
uint32_t
xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
{
uint32_t v = esmLOCKload(&lk->lock);
if (v == expected) esmLOCKstore(&lk->lock, val);
return v;
}
or perhaps:
uint32_t
xswap(volatile Spinlock *lk, uint32_t val)
{
uint32_t v = esmLOCKload(&lk->lock);
esmLOCKstore(&lk->lock, val);
return v;
}
But that all depends on the specifics of the atomics.
For the sizes of CSs this covers--it prevent priority inversion.
Except you've admitted that this is not a general solution, as
it cannot "cover" arbitrary software-defined critical sections:
the window isn't big enough and we've seen how even a fairly
small problem may exceed the number of cache lines supported by
one of these events.
So now I, as a systems software person, have to ask design
question around how I want to use these architectural features,
if at all, weighing various pros and cons. This is where I'm
struggling with your proposal: given that I'm going to have to
implement more general support for the things that the
architecture gives me anyway, what do I gain by also supporting
the specialized mechanisms?
And here I thought you had an imagination .....
You admitted you're not an OS person and asked other people to
poke holes in your proposal. I'm sorry if you find the holes
I'm poking upsetting.
On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,
Earlier you wrote,
"esmINTERFERENCE is a query on the state of th event and whether
a 2nd party has performed a memory action that kills the event."
Ok, I take this to mean _at the time of the store_, which
appears later in the program. If it were simply a query at the
time it was written in the code, then it seems like it opens a
TOCTOU bug (what if something interfers after the check, but
before the the locked store at the end?). Or perhaps it signals
to the architecture that this is the point to which the
processor should rollback, if anything subsequently fails; sort
of a `setjmp` kind of thing that conceptually returns twice.
The "if( esmINTERFERENCE() )" not only queries the state of the
participating cache lines, it sets up a guarding-shadow over
the STs to follow such that if killing-interference is detected
none of the STs are performed AND control is transferred to the
branch label (outside of the event). So not only does it query
the state at it execution, it monitors that over the rest of
the event.
BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
is that if we have reached the manifestation phase of the event
(that is ready to do all the stores) then we should be able to
drive the event forward by not sending the data in response to
the SNOOP. We send a Nak (based on priority or the data)
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,
Earlier you wrote,
"esmINTERFERENCE is a query on the state of th event and whether
a 2nd party has performed a memory action that kills the event."
Ok, I take this to mean _at the time of the store_, which
appears later in the program. If it were simply a query at the
time it was written in the code, then it seems like it opens a
TOCTOU bug (what if something interfers after the check, but
before the the locked store at the end?). Or perhaps it signals
to the architecture that this is the point to which the
processor should rollback, if anything subsequently fails; sort
of a `setjmp` kind of thing that conceptually returns twice.
The "if( esmINTERFERENCE() )" not only queries the state of the >>participating cache lines, it sets up a guarding-shadow over
the STs to follow such that if killing-interference is detected
none of the STs are performed AND control is transferred to the
branch label (outside of the event). So not only does it query
the state at it execution, it monitors that over the rest of
the event.
BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
is that if we have reached the manifestation phase of the event
(that is ready to do all the stores) then we should be able to
drive the event forward by not sending the data in response to
the SNOOP. We send a Nak (based on priority or the data)
If you can NAK a snoop, are you then assuming that all DMA is
non-coherent? That will really piss off the kernel folks.
Remember only the lines participating in the event count. For
example, say one had sprintf() inside the event. None of the
lines touched by sprint() are participating in the event,
and are not counted against the 8 cache lines available.
Hmm... so if I need to add a "participating line" to which I need to
store but that I don't need to read, I still need to do a "useless" `esmLOCKload` before the (normal) store, just to mark the line as participating in the event?
I guess it's a rather rare need.
Stefan
On Wed, 26 Mar 2025 21:35:01 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,
Earlier you wrote,
"esmINTERFERENCE is a query on the state of th event and whether
a 2nd party has performed a memory action that kills the event."
Ok, I take this to mean _at the time of the store_, which
appears later in the program. If it were simply a query at the
time it was written in the code, then it seems like it opens a
TOCTOU bug (what if something interfers after the check, but
before the the locked store at the end?). Or perhaps it signals
to the architecture that this is the point to which the
processor should rollback, if anything subsequently fails; sort
of a `setjmp` kind of thing that conceptually returns twice.
The "if( esmINTERFERENCE() )" not only queries the state of the >>>participating cache lines, it sets up a guarding-shadow over
the STs to follow such that if killing-interference is detected
none of the STs are performed AND control is transferred to the
branch label (outside of the event). So not only does it query
the state at it execution, it monitors that over the rest of
the event.
BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
is that if we have reached the manifestation phase of the event
(that is ready to do all the stores) then we should be able to
drive the event forward by not sending the data in response to
the SNOOP. We send a Nak (based on priority or the data)
If you can NAK a snoop, are you then assuming that all DMA is
non-coherent? That will really piss off the kernel folks.
Not at all. DMA is as coherent as the I/O MMU's translating
PTE prescribes.
But; why would any sane coder be performing an ATOMIC event on
memory that can be blindly overwritten by DMA ??? Or for that
mater, critical section data being overwritten (at some unknown
and unpredictable time) by DMA ??
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 26 Mar 2025 21:35:01 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,
Earlier you wrote,
"esmINTERFERENCE is a query on the state of th event and whether
a 2nd party has performed a memory action that kills the event."
Ok, I take this to mean _at the time of the store_, which
appears later in the program. If it were simply a query at the
time it was written in the code, then it seems like it opens a
TOCTOU bug (what if something interfers after the check, but
before the the locked store at the end?). Or perhaps it signals
to the architecture that this is the point to which the
processor should rollback, if anything subsequently fails; sort
of a `setjmp` kind of thing that conceptually returns twice.
The "if( esmINTERFERENCE() )" not only queries the state of the >>>>participating cache lines, it sets up a guarding-shadow over
the STs to follow such that if killing-interference is detected
none of the STs are performed AND control is transferred to the
branch label (outside of the event). So not only does it query
the state at it execution, it monitors that over the rest of
the event.
BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
is that if we have reached the manifestation phase of the event
(that is ready to do all the stores) then we should be able to
drive the event forward by not sending the data in response to
the SNOOP. We send a Nak (based on priority or the data)
If you can NAK a snoop, are you then assuming that all DMA is
non-coherent? That will really piss off the kernel folks.
Not at all. DMA is as coherent as the I/O MMU's translating
PTE prescribes.
But; why would any sane coder be performing an ATOMIC event on
memory that can be blindly overwritten by DMA ??? Or for that
mater, critical section data being overwritten (at some unknown
and unpredictable time) by DMA ??
Polling on a completion status which the device DMAs
(often with an allocate hint) is not uncommon. A 32-bit
word in memory updated by the device - the allocate hint
might update the LLC directly and invalidate L1/L2 or
may update the target cache line that the driver is
polling directly.
a going to make a big difference here. But I also wonder at the
internal implementation of these instructions in e.g. that
Fujitsu chip that allows throughput to stay more or less
constant as the number of cores increases; I'd think that cache
coherency traffic as threads contend for a single cache line
would create a bottleneck leading to (at least) a linear
slow down.
IIUC the idea is that the atomic-add is sent to the memory and performed there, instead of moving the data to the cache. So there's no
contention for the cache line because the data is not in a cache at all.
Planning on having the Q+ CPU respond directly to MSI interrupts. My
thought is to integrate an interrupt controller directly into the CPU
which will have several queues to prioritize interrupts. The controller
will take care of detecting the highest priority interrupt to process, performing priority inversion occasionally, and detect stuck interrupts.
Setting up an MSI interrupt protocol for Q+.
The controller will sit on the CPU's main response collect bus.
Interrupt messages contain the interrupter's bus/device/function + a
cause code. They may also contain a target core number.
Planning on distributing the interrupt control to peripheral devices
instead of having a system level interrupt controller.
On 2025-03-13 12:34 p.m., MitchAlsup1 wrote:
On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:
Planning on having the Q+ CPU respond directly to MSI interrupts. My
thought is to integrate an interrupt controller directly into the CPU
which will have several queues to prioritize interrupts. The controller
will take care of detecting the highest priority interrupt to process,
performing priority inversion occasionally, and detect stuck interrupts.
My 66000 went all the way to MSI-X interrupts as its base.
There are mapping methods to map INTx and INTk into MSI-X messages.
MSI->MSI-X mapping is "just zero extend the message".
MSI interrupts are the way to go when there are transistors available.
Just having discovered them not to long ago, I still think they are great.
I have the luxury of not needing to support legacy interrupt systems
(yet), all the peripheral cores needing interrupts will have to support
MSI responses.
I have read that MSI devices target particular I/O addresses to send the >messages to. Not sure what having a dedicated I/O addresses buys. In my >architecture, rather than target an I/O address a specific processing
The Interrupt port lives in the memory controller and takes an
aperture of addresses and maps them to various interrupt tables.
core number is specified.
My architecture goes with a 3-bit priority field, and an 8-bit field to
help resolve the cause of the interrupt. The bus/device/function is also
part of the interrupt response.
On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:
Planning on having the Q+ CPU respond directly to MSI interrupts. My
thought is to integrate an interrupt controller directly into the CPU
which will have several queues to prioritize interrupts. The controller
will take care of detecting the highest priority interrupt to process,
performing priority inversion occasionally, and detect stuck interrupts.
My 66000 went all the way to MSI-X interrupts as its base.
There are mapping methods to map INTx and INTk into MSI-X messages. >MSI->MSI-X mapping is "just zero extend the message".
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:
Planning on having the Q+ CPU respond directly to MSI interrupts. My
thought is to integrate an interrupt controller directly into the CPU
which will have several queues to prioritize interrupts. The controller
will take care of detecting the highest priority interrupt to process,
performing priority inversion occasionally, and detect stuck interrupts.
My 66000 went all the way to MSI-X interrupts as its base.
There are mapping methods to map INTx and INTk into MSI-X messages. >>MSI->MSI-X mapping is "just zero extend the message".
In both cases, the a 32-bit data payload (the MSIX Vector #) is written
to the MSI-X target address (typically the interrupt controller).
The difference between them is
MSI MSI-X
Supports 1 to 32 Supports an unlimited number of
_contiguous_ data vectors, each with an unique non-contiguous
payloads (vector vector index (data payload) and target
index). address.
With MSI, only the base interrupt number (data payload) is
defined and only one interrupt controller target address is
use used for all vectors. So vector 0 will send the base number, vector
1 will send the base number plus one, etc.
With MSI-X each vector has an assigned data payload (vector
number) which do not need to be contiguous or monotonically
increasing. With MSI-X, each interrupt can have a different
target address (i.e. perhaps a secure interrupt controller
for some, and a non-secure interrupt controller for others)
In the ARM case, there are different interrupt types (SGI,
PPI, SPI, LPI) and each one has a different interrupt controller
address that the MSI-X vector address should point to.
This
allows the message-based emulation of level sensitive interrupts
by using two vectors, one to assert the interrupt and a second
to deassert the interrupt at the interrupt controller.
So MSI is useless for modern devices and basically deprecated
and cannot be used support level sensitive (e.g. legacy PCI)
interrupts.
Most modern devices advertise the MSI-X capability instead.
On 2025-03-13 12:34 p.m., MitchAlsup1 wrote:
On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:
Planning on having the Q+ CPU respond directly to MSI interrupts. My
thought is to integrate an interrupt controller directly into the CPU
which will have several queues to prioritize interrupts. The controller
will take care of detecting the highest priority interrupt to process,
performing priority inversion occasionally, and detect stuck interrupts.
My 66000 went all the way to MSI-X interrupts as its base.
There are mapping methods to map INTx and INTk into MSI-X messages.
MSI->MSI-X mapping is "just zero extend the message".
MSI interrupts are the way to go when there are transistors available.
Just having discovered them not to long ago, I still think they are
great.
I have the luxury of not needing to support legacy interrupt systems
(yet), all the peripheral cores needing interrupts will have to support
MSI responses.
I have read that MSI devices target particular I/O addresses to send the messages to. Not sure what having a dedicated I/O addresses buys. In my architecture, rather than target an I/O address a specific processing
The Interrupt port lives in the memory controller and takes an
aperture of addresses and maps them to various interrupt tables.
core number is specified.
There is a broadcast mechanism that notifies cores of new higher
priority level interrupts pending, and a protocol for getting
an interrupt and notifying the IC that it will be handled by a
core.
Along with the 32-bit message, the architecture provides for
6-bits of priority, 2-bits of SW-stack indexing, and an 8-bit
ISR handler index number.
My architecture goes with a 3-bit priority field, and an 8-bit field to
help resolve the cause of the interrupt.
The bus/device/function is alsoPCIe
part of the interrupt response. A 6-bit target core number field is
included in the response too. A target core number of all ones is
treated as a broadcast to all cores. The entire interrupt response
message is available to the CPU so software can decide how to handle the interrupt. There is some potential for data to be passed back. The
contents of the data field passed back on the bus is
unspecified. The
other info is passed back in the address field of the bus.
Shucks.
Interrupt tables remain active even when the virtual machine
it inactive. When next scheduled, priority ISRs will be serviced
prior to low priority thread instructions.
{And some things I can't talk about}
Setting up an MSI interrupt protocol for Q+.
The controller will sit on the CPU's main response collect bus.
Interrupt messages contain the interrupter's bus/device/function + a
cause code. They may also contain a target core number.
Planning on distributing the interrupt control to peripheral devices
instead of having a system level interrupt controller.
I sketched up an interrupt controller in SystemVerilog and it was only
about 700 LUTs although it does use seven BRAMs for priority queues.
The priority queues can hold hundreds of interrupt messages which is
likely far more than would be needed. So if a queue overflows it would indicate an issue with the interrupt system. The controller also keeps a message history with timestamps. If the same message is present at the
start and end of the history along with a short difference in the
timestamp, it is considered a stuck interrupt.
The controller inverts interrupt priorities for one cycle every 64 clock cycles. The priority inversion is with respect to which queue is read
from. The idea is to try and prevent a low-priority task that is waiting
on an interrupt from being permanently disabled.
On 2025-03-13 2:42 p.m., Scott Lurndal wrote:
Robert Finch <robfi680@gmail.com> writes:
On 2025-03-13 12:34 p.m., MitchAlsup1 wrote:
On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:
Planning on having the Q+ CPU respond directly to MSI interrupts. My >>>>> thought is to integrate an interrupt controller directly into the CPU >>>>> which will have several queues to prioritize interrupts. The controller >>>>> will take care of detecting the highest priority interrupt to process, >>>>> performing priority inversion occasionally, and detect stuck interrupts. >>>>My 66000 went all the way to MSI-X interrupts as its base.
There are mapping methods to map INTx and INTk into MSI-X messages.
MSI->MSI-X mapping is "just zero extend the message".
MSI interrupts are the way to go when there are transistors available.
Just having discovered them not to long ago, I still think they are
great.
I have the luxury of not needing to support legacy interrupt systems
(yet), all the peripheral cores needing interrupts will have to support
MSI responses.
MSI has been deprecated for years. MSI-X is preferred and used by
most modern devices (and supported by major operating software).
I have read that MSI devices target particular I/O addresses to send the >>> messages to. Not sure what having a dedicated I/O addresses buys. In my
The Interrupt port lives in the memory controller and takes an
aperture of addresses and maps them to various interrupt tables.
architecture, rather than target an I/O address a specific processing
core number is specified.
There are many reasons that MSI-X supports vector specific target
addresses.
1) It is possible to target DRAM (so software can poll for the
interrupt)
This can be done using a dummy core in the system that echos interrupts messages back to the DRAM. A current issue with the way interrupts in my system are working, is they always spit out the interrupt on the
response bus, which means they cannot act like the master.
To drive messages onto the bus as requests instead of responses every peripheral device would need to be able to be a bus master. It would add
a lot of logic to the system. Extra cross-bar ports etc.
2) It is possible to support level-sensitive semantics by using two
vectors;
one to assert the interrupt and the second to deassert it. Each
needs
to target a different interrupt controller register (address).
I think this can be handled with the cause code being returned as part
of the response.
3) A system may have multiple interrupt controllers for different
security states (e.g. monitor/secure vs. hypervisor vs. direct
virtual
to a guest).
Is this just to get more target vectors for each interrupt? Could there
be one interrupt controller with multiple target vectors depending on
the security state?
Q+ CPU always switches to machine mode to process interrupts. It can
then redirect to other operating modes. A security state code could be
added to the response so that different vector tables may be used.
My architecture goes with a 3-bit priority field, and an 8-bit field to
help resolve the cause of the interrupt. The bus/device/function is also >>> part of the interrupt response.
That's what ARM64 GIC does, each interrupt is accompanied by a 20+ bit
'device id', which is composed of the segment, bus, dev/func values
for the sending device. Since each PCIe controller can consume the
entire bus space potentially, the segment widens the device id space
to include a large number of devices. The 'device id' is used to
select the in-memory table used to map the interrupt to the target
guest and the target guest vector number.
Seems like going out to memory tables to fetch the targets of an
interrupt would be costly, perhaps slow. In the SoC there are already a
half dozen devices fighting for access to the DRAM.
I think you really want an interrupt controller as an intermediary
between the device and the CPU interrupt signal, allows a great deal
of implementation flexibility (e.g. changing the target CPU in the
Absolutely. Be crazy not to. There is an irq controller component
separate from the CPU. (I decided to move the controller outside of the
CPU).
interrupt controller tables rather than requiring the device driver
to change the msix-vector address field).
Updating I/O is faster than updating DRAM tables.
On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:
Most modern devices advertise the MSI-X capability instead.
And why not:: its just a few more flip-flops and almost no more
sequencing logic.
Updating I/O is faster than updating DRAM tables.
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:
Most modern devices advertise the MSI-X capability instead.
And why not:: its just a few more flip-flops and almost no more
sequencing logic.
I've seen several devices with more than 200 MSI-X vectors;
thats 96 bits per vector to store a full 64-bit address and
32-bit data payload.
Keep in mind that a guest operating system may be writing
the 64-bit address field in a virtual function assigned to
it with guest virtual addresses, so the inbound path for
MSI-X needs to traverse the IOMMU before hitting the
interrupt controller logic.
The guest OS also specifies
the data payload, which may be an IRQ number on an intel
system, or a virtual IRQ number translated by the IOMMU
or Interrupt controller into a physical IRQ number (allowing
multiple guest OS to use the same IRQs they would use on real
hardware).
On 2025-03-13 2:42 p.m., Scott Lurndal wrote:
Robert Finch <robfi680@gmail.com> writes:
On 2025-03-13 12:34 p.m., MitchAlsup1 wrote:
On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:
Planning on having the Q+ CPU respond directly to MSI interrupts. My >>>>> thought is to integrate an interrupt controller directly into the CPU >>>>> which will have several queues to prioritize interrupts. The controller >>>>> will take care of detecting the highest priority interrupt to process, >>>>> performing priority inversion occasionally, and detect stuck interrupts. >>>>My 66000 went all the way to MSI-X interrupts as its base.
There are mapping methods to map INTx and INTk into MSI-X messages.
MSI->MSI-X mapping is "just zero extend the message".
MSI interrupts are the way to go when there are transistors available.
Just having discovered them not to long ago, I still think they are great. >>> I have the luxury of not needing to support legacy interrupt systems
(yet), all the peripheral cores needing interrupts will have to support
MSI responses.
MSI has been deprecated for years. MSI-X is preferred and used by
most modern devices (and supported by major operating software).
I have read that MSI devices target particular I/O addresses to send the >>> messages to. Not sure what having a dedicated I/O addresses buys. In my
The Interrupt port lives in the memory controller and takes an
aperture of addresses and maps them to various interrupt tables.
architecture, rather than target an I/O address a specific processing
core number is specified.
There are many reasons that MSI-X supports vector specific target addresses. >> 1) It is possible to target DRAM (so software can poll for the interrupt)
This can be done using a dummy core in the system that echos interrupts >messages back to the DRAM. A current issue with the way interrupts in my >system are working, is they always spit out the interrupt on the
response bus, which means they cannot act like the master.
To drive messages onto the bus as requests instead of responses every >peripheral device would need to be able to be a bus master. It would add
a lot of logic to the system. Extra cross-bar ports etc.
2) It is possible to support level-sensitive semantics by using two vectors;
one to assert the interrupt and the second to deassert it. Each needs >> to target a different interrupt controller register (address).
I think this can be handled with the cause code being returned as part
of the response.
3) A system may have multiple interrupt controllers for different
security states (e.g. monitor/secure vs. hypervisor vs. direct virtual >> to a guest).
Is this just to get more target vectors for each interrupt? Could there
be one interrupt controller with multiple target vectors depending on
the security state?
That's what ARM64 GIC does, each interrupt is accompanied by a 20+ bit
'device id', which is composed of the segment, bus, dev/func values
for the sending device. Since each PCIe controller can consume the
entire bus space potentially, the segment widens the device id space
to include a large number of devices. The 'device id' is used to
select the in-memory table used to map the interrupt to the target
guest and the target guest vector number.
Seems like going out to memory tables to fetch the targets of an
interrupt would be costly, perhaps slow. In the SoC there are already a
half dozen devices fighting for access to the DRAM.
I think you really want an interrupt controller as an intermediary
between the device and the CPU interrupt signal, allows a great deal
of implementation flexibility (e.g. changing the target CPU in the
Absolutely. Be crazy not to. There is an irq controller component
separate from the CPU. (I decided to move the controller outside of the
CPU).
interrupt controller tables rather than requiring the device driverUpdating I/O is faster than updating DRAM tables.
to change the msix-vector address field).
I need to study more on the use of interrupt tables.
On Thu, 13 Mar 2025 19:59:02 +0000, Robert Finch wrote:
Is this just to get more target vectors for each interrupt? Could there
be one interrupt controller with multiple target vectors depending on
the security state?
Take a AHCI STAT drive, that has multiple GuestOSs and several
HyperVisors using various functions therein.
I think you really want an interrupt controller as an intermediary
between the device and the CPU interrupt signal, allows a great deal
of implementation flexibility (e.g. changing the target CPU in the
Wrong abstraction::
You want an interrupt service device that services interrupts
(logs them into tables) even when no core (Virtual Machine or
Virtual Machine Monitor) knows anything about said interrupt
table. That is an interrupt to a table is serviced even when
there is no recipient at this instant in time. This prevents
loss of interrupts the table is supposed to accumulate.
Only when at least 1 core is processing VM[k] or VMM[j]
associated with that interrupt table are cores running
VM[k] or VMM[j] concerned with any random interrupt from
that table.
----------------
Connecting the missing dots can be done several ways, My
66000 does::
My 66000 has a service port (in the memory controller) which
services the MMI/O requests to the interrupt aperture. I (currently)
call this the interrupt service port (rather than interrupt
controller) because it just services messages to the interrupt
aperture with almost no regard to any consumer. Said consumer
regard is via a broadcast MMI/O message using the DRAM address
of the interrupt table with a 64-bit message of raised interrupts.
interrupt controller tables rather than requiring the device driver
to change the msix-vector address field).
The MSI-X address is (nominally) set once just after GuestOS device >discovery. And becomes a virtual constant pointer afterwards.
Updating I/O is faster than updating DRAM tables.
Expand:: is this the distinction between flip-flops and a RAM
buried inside a pre-defined pin-boundary ??
On Thu, 13 Mar 2025 21:14:08 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:
Most modern devices advertise the MSI-X capability instead.
And why not:: its just a few more flip-flops and almost no more >>>sequencing logic.
I've seen several devices with more than 200 MSI-X vectors;
thats 96 bits per vector to store a full 64-bit address and
32-bit data payload.
At this point, with 200+ entries, flip-flops are not recommended,
instead these would be placed in a RAM of some sort. Since RAMs
come in 1KB and 2KB quanta; we have 1 of 2K and 1 of 1K and we
have 256 said message containers, with 1 cycle access (after
you get to that corner of some chip).
Keep in mind that a guest operating system may be writing
the 64-bit address field in a virtual function assigned to
it with guest virtual addresses, so the inbound path for
MSI-X needs to traverse the IOMMU before hitting the
interrupt controller logic.
An important concept that is hard to find in the PCIe specifications.
The guest OS also specifies
the data payload, which may be an IRQ number on an intel
system, or a virtual IRQ number translated by the IOMMU
or Interrupt controller into a physical IRQ number (allowing
multiple guest OS to use the same IRQs they would use on real
hardware).
Just note that GuestOS[k].IRQ[j] uses the same bit patterns
at the device, they get delivered to their own unique
GuestOS through its interrupt table.
But even when GuestOS[k] and GuestOS[j] use the same 96-bit MSI-X
pattern, each GuestOS can interpret its message the way it wants,
sort its drivers in the way it wants, and operate with blissful
ignorance of GuestOS[other].
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 21:14:08 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:
Most modern devices advertise the MSI-X capability instead.
And why not:: its just a few more flip-flops and almost no more >>>>sequencing logic.
I've seen several devices with more than 200 MSI-X vectors;
thats 96 bits per vector to store a full 64-bit address and
32-bit data payload.
At this point, with 200+ entries, flip-flops are not recommended,
instead these would be placed in a RAM of some sort. Since RAMs
come in 1KB and 2KB quanta; we have 1 of 2K and 1 of 1K and we
have 256 said message containers, with 1 cycle access (after
you get to that corner of some chip).
That is 200 _per function_. Consider a physical function that
supports the SRIOV capability and configures 2048 virtual
functions. So that's 2049 * 200 MSI-X vectors just for one
device. It's not unusual. These vectors are, of course,
stored on the device itself. Mostly in RAMs.
Keep in mind that a guest operating system may be writing
the 64-bit address field in a virtual function assigned to
it with guest virtual addresses, so the inbound path for
MSI-X needs to traverse the IOMMU before hitting the
interrupt controller logic.
An important concept that is hard to find in the PCIe specifications.
It's more of a system-level thing than a PCI thing, so I
wouldn't expect to find it the PCI specification. System
level standards (like ARM's SBSA (server base system architecture)
covers these type of topics).
The guest OS also specifies
the data payload, which may be an IRQ number on an intel
system, or a virtual IRQ number translated by the IOMMU
or Interrupt controller into a physical IRQ number (allowing
multiple guest OS to use the same IRQs they would use on real
hardware).
Just note that GuestOS[k].IRQ[j] uses the same bit patterns
at the device, they get delivered to their own unique
GuestOS through its interrupt table.
But even when GuestOS[k] and GuestOS[j] use the same 96-bit MSI-X
pattern, each GuestOS can interpret its message the way it wants,
sort its drivers in the way it wants, and operate with blissful
ignorance of GuestOS[other].
Linux, for example, knows that the format of the data payload
on an MSI-X message is an unformatted 32-bit integer. Limiting
that or defining a hard format is a non-starter, as the kernel
software folks will just point and laugh. They really don't
like accomodating new architectures that play fast and loose
with the standard behavior. DAMHIKT.
Updating I/O is faster than updating DRAM tables.
What do you mean by DRAM tables?
On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:
In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
So that you can have some mechanism for building critical
sections that span multiple instructions, where just having
access to atomics and spin locks isn't enough.
For what kinds of activities are these critical sections ??
For example, I can schedule a DPC/sofIRQ without using a critical
section (and so can even user threads without privilege, given
the MMU allows). So, we need to dive through the "rest of the
available mechanics" -- keyword available !
There are mechanisms available that readers may not be aware of
or quite grasp how a feature can be used.
So it comes back to:: For what kinds of activities are these
critical sections ?? We know about DPC/softIRQs, we know
about atomic events around core scheduling, and inter-{thread,
process, core} communication. Others ?
When running at any privilege, do you not want to accept an interrupt
at higher priority with as little delay as possible ??
Usually the assumption is that the critical section is short; a
handful of instructions, perhaps, so that the cost of blocking a
high priority interrupt is low.
I think you are talking about critical sections that are not allowed
to fail or be interfered with:: Yes ?
On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:
In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
So that you can have some mechanism for building critical
sections that span multiple instructions, where just having
access to atomics and spin locks isn't enough.
You can do this with priority, too.
On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:
In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
So that you can have some mechanism for building critical
sections that span multiple instructions, where just having
access to atomics and spin locks isn't enough.
For what kinds of activities are these critical sections ??
For example, I can schedule a DPC/sofIRQ without using a critical
section (and so can even user threads without privilege, given
the MMU allows). So, we need to dive through the "rest of the
available mechanics" -- keyword available !
There are mechanisms available that readers may not be aware of
or quite grasp how a feature can be used.
So it comes back to:: For what kinds of activities are these
critical sections ?? We know about DPC/softIRQs, we know
about atomic events around core scheduling, and inter-{thread,
process, core} communication. Others ?
When running at any privilege, do you not want to accept an interrupt
at higher priority with as little delay as possible ??
Usually the assumption is that the critical section is short; a
handful of instructions, perhaps, so that the cost of blocking a
high priority interrupt is low.
I think you are talking about critical sections that are not allowed
to fail or be interfered with:: Yes ?
In article <af2d54a7c6c694bf18bcca6e6876a72b@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:
In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
So that you can have some mechanism for building critical
sections that span multiple instructions, where just having
access to atomics and spin locks isn't enough.
You can do this with priority, too.
In the limit, disabling interrupts is just setting the priority
mask to infinity, so in some sense this is de facto true. But I
don't see how having a prioritization scheme is all that useful.
Consider that when you're starting the system up, you need to
do things like set up an interrupt handler vector, and so on;
surely you want to block all interrupts until you have done so.
If your system model is that receipt of an interrupt may result
in a context switch, and you don't want to do that while holding
a mutex similarly.
Sometimes you really don't want to be interrupted.
- Dan C.
On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:
In article <af2d54a7c6c694bf18bcca6e6876a72b@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:
In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
So that you can have some mechanism for building critical
sections that span multiple instructions, where just having
access to atomics and spin locks isn't enough.
You can do this with priority, too.
In the limit, disabling interrupts is just setting the priority
mask to infinity, so in some sense this is de facto true. But I
don't see how having a prioritization scheme is all that useful.
Consider that when you're starting the system up, you need to
do things like set up an interrupt handler vector, and so on;
The Interrupt vector is completely software. HW delivers the MSI-X
message and the service for which this message is for, and SW decides
what to do with this message to that service provider. No HW vectoring
but HW does deliver all the bits needed to guide SW to its goal.
surely you want to block all interrupts until you have done so.
I did not say I don't HAVE an interrupt disable, I state that one
receives control in an already reentrant state, so the typical
uses of ID are not in play--and I want to talk about the other
uses of ID. You are doing a fine job.
If your system model is that receipt of an interrupt may result
in a context switch, and you don't want to do that while holding
a mutex similarly.
Given the privilege to set priority, setting of priority is that
same 1 cycle event that setting or clearing ID would be.
Sometimes you really don't want to be interrupted.
And sometimes you don't want to be interrupted unless the
"house is on fire"; I cannot see a time when "the house is
on fire" that you would not want to take the interrupt.
Is there one ?!?
In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:
In article <af2d54a7c6c694bf18bcca6e6876a72b@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:
In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
I want to address the elephant in the room::
Why disable interrupts AT ALL !!
So that you can have some mechanism for building critical
sections that span multiple instructions, where just having
access to atomics and spin locks isn't enough.
You can do this with priority, too.
In the limit, disabling interrupts is just setting the priority
mask to infinity, so in some sense this is de facto true. But I
don't see how having a prioritization scheme is all that useful.
Consider that when you're starting the system up, you need to
do things like set up an interrupt handler vector, and so on;
The Interrupt vector is completely software. HW delivers the MSI-X
message and the service for which this message is for, and SW decides
what to do with this message to that service provider. No HW vectoring
but HW does deliver all the bits needed to guide SW to its goal.
Maybe I'm missing the context here, and apologies if I am. But
which interrupt disable are we talking about here, and where? I
am coming at this from the perspective of systems software
running on some CPU core somewhere; if I want that software to
enter a critical section that cannot be interrupted, then it
seems to me that MSI-X delivery is just a detail at that point.
I might want to prevent IPIs, or delivery or local interrupts
attached directly to the CPU (or something like an IOAPIC or
GIC distributor).
In any event, if software isn't prepared to receive an interrupt
then there must necessarily be some mechanism to disable
interrupt delivery until it is.
surely you want to block all interrupts until you have done so.
I did not say I don't HAVE an interrupt disable, I state that one
receives control in an already reentrant state, so the typical
uses of ID are not in play--and I want to talk about the other
uses of ID. You are doing a fine job.
I suppose what I'm saying in part is that the "typical users"
all depend on the system's software architecture. The ability
to handle nested interrupts at the hardware level is all well
and good, but doesn't matter if software can't handle it.
If your system model is that receipt of an interrupt may result
in a context switch, and you don't want to do that while holding
a mutex similarly.
Given the privilege to set priority, setting of priority is that
same 1 cycle event that setting or clearing ID would be.
I guess I don't see how that is relevant. I mean, again, I
suppose you could say, "set the priority mask to infinity to de
facto disable interrupts", but that seems like a distinction
without a difference.
Sometimes you really don't want to be interrupted.
And sometimes you don't want to be interrupted unless the
"house is on fire"; I cannot see a time when "the house is
on fire" that you would not want to take the interrupt.
Is there one ?!?
Consider a thread that takes a spinlock; suppose some
high-priority interrupt comes in while the thread is holding
that lock. In response to the interrupt, software decides to
suspend the thread and switch some other thread; that thread
wants to lock the spin lock that the now-descheduled thread is
holding: a classic deadlock scenario.
A valid response here might be, "don't context switch from the
interrupt handler; use a DPC instead". That may be valid, but
it puts a constraint on the software designer that may be onerus
in practice: suppose the interrupt happens in the scheduler,
while examining a run queue or something. A DPC object must be
available, etc.
Further, software must now consider the complexity of
potentially interruptable critical sections. From the
standpoint of reasoning about already-complex concurrency issues
it's simpler to be able to assert that (almost) all interrupt
delivery can be cheaply disabled entirely, save for very
special, specific, scenarios like NMIs. Potentially switching
away from a thread holding a spinlock sort of defeats the
purpose of a spinlock in the first place, which is a mutex
primitive designed to avoid the overhead of switching.
- Dan C.
In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
But let me turn it around, what real-world problem does this
interrupt prioritization scheme solve coupled with the inability
to disable interrupts solve?
Is anyone complaining that the
ability to disable interrupts is somehow a serious burden?
- Dan C.
On Wed, 26 Mar 2025 23:48:17 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 26 Mar 2025 21:35:01 +0000, Scott Lurndal wrote:
[snip]
If you can NAK a snoop, are you then assuming that all DMA is
non-coherent? That will really piss off the kernel folks.
Not at all. DMA is as coherent as the I/O MMU's translating
PTE prescribes.
But; why would any sane coder be performing an ATOMIC event on
memory that can be blindly overwritten by DMA ??? Or for that
mater, critical section data being overwritten (at some unknown
and unpredictable time) by DMA ??
Polling on a completion status which the device DMAs
(often with an allocate hint) is not uncommon. A 32-bit
word in memory updated by the device - the allocate hint
might update the LLC directly and invalidate L1/L2 or
may update the target cache line that the driver is
polling directly.
Why do this ATOMICally ?? why not just read it ?? Or is this a
grab (read and modify so someone else doesn't get it.) ??
ESM operates in 3 modes {optimistic, Greedy, Methodological}[...]
Sounds fancy, but it also makes me worry about the complexity of >verifying&testing correctness (of the implementation).
ESM operates in 3 modes {optimistic, Greedy, Methodological}[...]
On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
[snip]
1. `from` may point to a cache line
2. `to` may point to a different cache line.
3. Does your architecture use a stack?
No
What sort of alignment criteria do you impose?
On the participating data none. On the cache line--cache line.
At first blush, it seems to me that
the pointers `from_next`, `from_prev`, and `to_next` could be
on the stack and if so, will be on at least one cache line,
and possibly 2, if the call frame for `MoveElement` spans
more than one.
The illustrated code is using 6 participating cache lines.
Where local variables are kept (stack, registers) does not
count against the tally of participating cache lines.
Of course, it's impossible to say when this
is compiled, but maybe you require that the stack is always
aligned to a cache line boundary or something.
Data stack is DoubleWord aligned at all times.
these are all in registers and it's fine.
Local variables are not participating in the event--they are just
around to guide the event towards conclusion.
4. Depending on the size and alignment criteria of the the
`Element` structure and the placement of the `next` and
`prev` elements in the struct (unknowable from this example),
`to->next` may be on another cache line.
5. Similarly, `from->next`
6. And `from->prev`
7. And `from_prev->next`
8. And `from_next->prev`
9. And `to_next->prev`.
So potentially this code _could_ touch 9 or 10 cache lines, more
than you said are supported for a single atomic event. What
happens in that case? Does the code just always return false?
The programmer best make sure the optimizer is keeping those
temporary pointers off the stack; good luck with a debug build.
Anyway, let's assume that's not the case, and we don't hit the
pathological case with cache lines, and we're not dealing with
edge cases at the front or end of the lists (so various next or
prev pointers are not NULL). Why don't I try to explain what I
think this code is doing, and you tell me whether I'm right or
wrong.
If I understand this correctly, the `esmLOCKload`s are simply
loads, but they make sure that the compiler emits instructions
that tie into your atomic support framework.
So far so good.
Earlier you wrote,
"esmINTERFERENCE is a query on the state of th event and whether
a 2nd party has performed a memory action that kills the event."
Ok, I take this to mean _at the time of the store_, which
appears later in the program. If it were simply a query at the
time it was written in the code, then it seems like it opens a
TOCTOU bug (what if something interfers after the check, but
before the the locked store at the end?). Or perhaps it signals
to the architecture that this is the point to which the
processor should rollback, if anything subsequently fails; sort
of a `setjmp` kind of thing that conceptually returns twice.
The "if( esmINTERFERENCE() )" not only queries the state of the
participating cache lines, it sets up a guarding-shadow over
the STs to follow such that if killing-interference is detected
none of the STs are performed AND control is transferred to the
branch label (outside of the event). So not only does it query
the state at it execution, it monitors that over the rest of
the event.
BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
is that if we have reached the manifestation phase of the event
(that is ready to do all the stores) then we should be able to
drive the event forward by not sending the data in response to
the SNOOP. We send a Nak (based on priority or the data)
If the SNOOPing core receives a NaK and it is not attempting an
event the request is simply reissued. If it is attempting, then
its event fails. So a core benignly accessing data that another
core is attempting an event upon, is delayed by an interconnect
round trip--but if it is attempting, it knows its event will
fail and it will either retry from before the event started,
or it will goto the interferrence-label where other facilities
\are at its disposal.
Assuming nothing has poisoned the event, the various pointer
stores inside the conditional are conditionally performed; the
`esmLOCKstore` to `from->next` commits the event, at which
point all side effects are exposed and durable.
From a SW perspective, that is accurate enough.
From a HW perspective, the participating STs are performed and
held in a buffer, and when it is known the event has succeeded,
a bit is flipped and these writes to memory all appear to have
been performed in the same cycle. One cycle earlier and no
interested 2nd party has seen any of them, the subsequent cycle
any interested 2nd party will see them all.
I should also note: ESM works even when there are no caches in
the system. ...
Is that correct? Let's suppose that it is.
But later you wrote,
So, if you want the property whereby the lock disappears on any
control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>then you want to use my ATOMIC stuff; otherwise, you can use the
normal ATOMIC primitives everyone and his brother provide.
Well, what precisely do you mean here when you say, "if you want
the property whereby the lock disappears on any control transfer
out of the event"?
If you want that property--you use the tools at hand.
If you don't just use them as primitive generators.
What I _hope_ you mean is that if you transfer "out of the
event" before the `esmLOCKstore` then anything that's been done
since the "event" started is rolled back, but NOT if control
transfer happens _after_ the `esmLOCKstore`. If THAT were the
case, then this entire mechanism is useless.
One can use a esmLOCKstore( *SP, 0 ) to abandon an event. HW
detects that *SP is not participating in the event, and uses
the lock bit in the instruction to know it is time to leave
the event in the failed manner. No ST is performed to the
non-participating cache line. {{I guess I could wrap the
abandonment into a macro ...}}
Perhaps you would clarify?
But for the time being, let's assume that `esmLOCKstore` is
conceptually a commit that ends the atomic event, and the store
is durable and observable after that. Let's see if we can
construct a simple spinlock from these primitives.
Notice that only the last ST has the esmLOCKstore() applied.
HW knows that STs to participating lines have the ATOMIC
property. The final ST uses the LOCK bit to denote the end
of the event.
typedef struct Spinlock Spinlock;
struct Spinlock {
uint32_t lock;
};
uint32_t
xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
{
uint32_t v = esmLOCKload(&lk->lock);
if (v == expected && !esmINTERFERENCE()) {
esmLOCKstore(&lk->lock, val)
}
return v;
}
If the interference check is desired at all:: (see below)
I think you should move interference check in front of the
expected check, but that is a first impression--have to think
about it more.
void
spin_acquire(volatile Spinlock *lk)
{
splhi();
while (xcas(lk, 0, 1) != 0)
relax(); // Spin loop intrinsic.
}
void
spin_release(volatile Spinlock *lk)
{
xcas(lk, 1, 0);
// Maybe we should `spllo` here, maybe not. Either choice
// reflects lots of assumptions about whether or not we might
// take and hold multiple locks concurrently, etc.
spllo();
}
Does this seem correct? It is not immediately clear to me
whether `esmINTERFERENCE` is actually needed here, but
avoiding the store seemed fine.
There are MANY cases where the interference query is not needed.
Without the interference query, deadly interference backs the
core up to the first instruction of the event. With the interference
check, if deadly interference happens before the query, control
reverts to before the event, and afterwards, control is transferred
to the interference-label.
One could imagine shorting
`xcas` to simply:
uint32_t
xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
{
uint32_t v = esmLOCKload(&lk->lock);
if (v == expected) esmLOCKstore(&lk->lock, val);
return v;
}
Yes, unless you want the alternate exit strategy, this works fine.
In fact test and set is written::
boolean char TestAndSet( type char *p )
{
boolean t = esmLOCKload ( *p );
esmLOCKstore( *p, 1 );
return t;
}
Should interference transpire between load and store control
reverts back to the LD again, so the return value is the LD
which did not encounter interference.
For the sizes of CSs this covers--it prevent priority inversion.
Except you've admitted that this is not a general solution, as
it cannot "cover" arbitrary software-defined critical sections:
the window isn't big enough and we've seen how even a fairly
small problem may exceed the number of cache lines supported by
one of these events.
Remember only the lines participating in the event count. For
example, say one had sprintf() inside the event. None of the
lines touched by springt() are participating in the event,
and are not counted against the 8 cache lines available.
---------------------
So now I, as a systems software person, have to ask design
question around how I want to use these architectural features,
if at all, weighing various pros and cons. This is where I'm
struggling with your proposal: given that I'm going to have to
implement more general support for the things that the
architecture gives me anyway, what do I gain by also supporting
the specialized mechanisms?
All I can say about this, was that when I presented ASF (ESM's
predecessor) to Microsoft and spend a couple hours explaining
to them how it works, MS wanted it implemented as rapidly as
AMD could pull it off. Dave Cutler was particularly impressed.
And here I thought you had an imagination .....
You admitted you're not an OS person and asked other people to
poke holes in your proposal. I'm sorry if you find the holes
I'm poking upsetting.
I enjoy your poking of holes--I really do
But you seem like the calculus student that keeps asking what
problem can I solve now that I can differentiate.
The illustrated code is using 6 participating cache lines.
Where local variables are kept (stack, registers) does not
count against the tally of participating cache lines.
Huh. How would this handle something like an MCS lock, where
the lock node may be on a stack, though accessible globally in
some virtual address space?
ESM doesn't distinguish the stack from other places. It just doesn't
pay attention to memory accesses that aren't marked specially. So, if
you lock is on the stack it will still work, as long as you use
`esmLOCKload` to read it.
Again, not what I was asking. I'm trying to understand what you
meant when you talked about disappearing locks.
He was talking about the conceptual lock used internally by the CPU
to implement an ESM event, that gives their name to `esmLOCK*`.
I.e. the same kind of lock used in other CPUs to implement LL/SC and
that gives its name to "load locked" (tho it's also sometimes named
"load linked").
The illustrated code is using 6 participating cache lines.
Where local variables are kept (stack, registers) does not
count against the tally of participating cache lines.
Huh. How would this handle something like an MCS lock, where
the lock node may be on a stack, though accessible globally in
some virtual address space?
Again, not what I was asking. I'm trying to understand what you
meant when you talked about disappearing locks.
In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
[snip]
1. `from` may point to a cache line
2. `to` may point to a different cache line.
3. Does your architecture use a stack?
No
You make multiple references to a stack just below. I
understand stack references are excluded from the atomic event,
but I asked whether the architecture uses a stack. Does it, or
doesn't it?
What sort of alignment criteria do you impose?
On the participating data none. On the cache line--cache line.
I meant for data on the stack. The point is moot, though, since
as you said stack data does not participate in the atomic event.
At first blush, it seems to me that
the pointers `from_next`, `from_prev`, and `to_next` could be
on the stack and if so, will be on at least one cache line,
and possibly 2, if the call frame for `MoveElement` spans
more than one.
The illustrated code is using 6 participating cache lines.
Where local variables are kept (stack, registers) does not
count against the tally of participating cache lines.
Huh.
How would this handle something like an MCS lock, where
the lock node may be on a stack, though accessible globally in
some virtual address space?
But later you wrote,
So, if you want the property whereby the lock disappears on any
control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>>then you want to use my ATOMIC stuff; otherwise, you can use the
normal ATOMIC primitives everyone and his brother provide.
Well, what precisely do you mean here when you say, "if you want
the property whereby the lock disappears on any control transfer
out of the event"?
If you want that property--you use the tools at hand.
If you don't just use them as primitive generators.
I wasn't clear enough here. I'm asking what, exactly, you mean
by this _property_,
not what you mean when you write that one
can use these atomic events if one wants the property. That is,
I'm asking you to precisely define what it means for a lock to
"disappear".
It seems clear enough _now_ that once the event concludes
successfully the lock value is set to whatever it was set to in
during the event, but that wasn't clear to me earlier and I
wanted confirmation.
In particular, it seems odd to me that one would bother with a
lock of some kind during an event if it didn't remain set after
the event.
If you do all of your critical section stuff inside
of the event, and setting the lock in the event is not visible
until the event concludes, why bother? If on the other hand you
use the event to set the lock, why bother doing additional work
inside the event itself, but in this case I definitely don't
want something else to come along and just unset the lock on its
own, higher priority or not.
What I _hope_ you mean is that if you transfer "out of the
event" before the `esmLOCKstore` then anything that's been done
since the "event" started is rolled back, but NOT if control
transfer happens _after_ the `esmLOCKstore`. If THAT were the
case, then this entire mechanism is useless.
One can use a esmLOCKstore( *SP, 0 ) to abandon an event. HW
detects that *SP is not participating in the event, and uses
the lock bit in the instruction to know it is time to leave
the event in the failed manner. No ST is performed to the
non-participating cache line. {{I guess I could wrap the
abandonment into a macro ...}}
Again, not what I was asking. I'm trying to understand what you
meant when you talked about disappearing locks.
Again, I think you meant that, if an atomic event that sets a
lock variable fails,
whether it got far enough along to set the
variable inside the event or not is not observable if the event
fails.
But if the event succeeds, then the variable is set and
that's that. IF that's the case, then great! But if not, then
disappearing locks are bad.
This is especially murky to me
because there seem to be bits of OS primitives, like threads and
priority, that are known directly to the hardware.
Ok, so perhaps this is a reasonable implementation of spinlock
acquire, as I wrote to Stefan yesterday:
void
spinlock_acquire(volatile Spinlock *lk)
{
while (esmLOCKload(&lk->lock) != 0)
cpu_spin_hint();
esmLOCKstore(&lk->lock, 1);
}
Yes? If this is preempted, or another CPU wins the race on the
lock, the hardware will back this up and start over in the while
loop, right?
Remember only the lines participating in the event count. For
example, say one had sprintf() inside the event. None of the
lines touched by springt() are participating in the event,
and are not counted against the 8 cache lines available.
This doesn't change the point, though. Anything that touches
more than 8 cache lines is not suitable for an atomic event.
Insertion into a doubly linked list is already up to 6 lines,
recall that my original hypothetical also involved popping off a
queue, which adds another (the queue head). If I threw up
incrementing a couple of counters I can't do it. And this seems
like a perfectly reasonable real-world scenario; consider a work
stealing thread scheduler that takes a task from one core to
another and keeps some stats on how things move.
Or maybe even just swapping places between two elements in a
linked list. For example:
But there isn't a lot of publicly available documentation for
any of this, at least none that I could find, and from the way
it's been described thus far it seems like there are a number of
edge cases that it is not clear how one is supposed to address.
So I want to understand the boundaries of where these hardware
facilities are applicable, given architectural limitations. I
think this thread has shown that those boundaries exist. The
universe of interesting and useful software is big, and someone,
somewhere, is going to trip over those limitations sooner or
later. Once we accept that, we must ask, "Then what?" It seems
obvious that programmers are going to have to find techniques to
work around those limitations, and that the mechanisms chosen
are unknowable to hardware.
But now we're back in the same boat of software having to deal
with all the thorny problems that these mechanisms were designed
to address in the first place, like priority inversion: take the
spinlock from earlier, the hardware has no semantic knowledge
that the lock is used as a mutex, and can't know; it's just some
bit in memory somewhere that was set during an atomic event.
Maybe ESM helps in a race when trying to acquire the spinlock,
but once it's acquired, it's acquired, and because the spinlock
is not a primitive understood by the hardware, the hardware has
no way of incorporating the spinlock's semantics into its view
of priority, threading, etc.
So while an atomic event might help me in the implementation of
a generic mutex of some kind, so that I can protect critical
sections that can't be handled with the atomic event framework
directly because of their size, that's the extent of the help it
can give me, because once the mutex is acquired the hardware has
no insight into the fact that it's a mutex.
So if some higher
priority thing subsequently comes along and preempts the thread
holding that lock and tries to take that lock itself, then the
hardware can't help me out anymore, and I'm back to a
priority-inversion deadlock that has to be dealt with by
software.
The point is, even if the hardware gives me a cool
mechanism for helping prevent priority inversion problems,
programmers are inevitably going to have to handle those
situations themselves anyway.
This is the sort of thing I'm trying to communicate, it's really
not me sitting here tossing out barbs like, "oh yeah?! Well,
how about THIS one, smart guy?"
- Dan C.
Stefan Monnier wrote:
a going to make a big difference here. But I also wonder at the
internal implementation of these instructions in e.g. that
Fujitsu chip that allows throughput to stay more or less
constant as the number of cores increases; I'd think that cache
coherency traffic as threads contend for a single cache line
would create a bottleneck leading to (at least) a linear
slow down.
IIUC the idea is that the atomic-add is sent to the memory and performed
there, instead of moving the data to the cache. So there's no
contention for the cache line because the data is not in a cache at all.
This one I have never understood the use/need for, except to provide
global shared counters of some sort:
In that case it would make perfect sense to allow any core to send (and immediately forget) a message "increment counter X by 1", but at least
in my own coding, any LOCK XADD has been performed because I now need to
do something with either the previous or the updated value of the
counter.
Having the atomic-add performed by a memory controller, returning the
result, would make it almost perfectly scalable, but would also limit
the speed of this operation to the bus transaction speed, even when
there is zero contention, right?
I guess if the atomic-add has been implemented at all/most cache levels
as well, then it would indeed perform better for all workloads, the
problem is that under contention you still have to flush the variable
out to the RAM array, right?
Terje
In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,-------------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
Or maybe even just swapping places between two elements in a
linked list. For example:
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
esmLOCKprefetch(*head);
if (*head == a) // see !
*head = b;
else if (*head == b)
*head = a;
an = esmLOCKload(a->next);
ap = esmLOCKload(a->prev);
bn = esmLOCKload(b->next);
bp = esmLOCKload(b->prev);
b->next = an;
if (an != NULL) {
esmLOCKprefetch(an->prev);
an->prev = b;
}
b->prev = ap;
if (ap != NULL) {
esmLOCKprefetch(ap->next);
ap->next = b;
}
a->next = bn;
if (bn != NULL) {
esmLOCKprefetch(bn->prev);
bn->prev = a;
}
if (bp != NULL) {
esmLOCKprefetch(bp->next);
bp->next = a;
}
esmLOCKstore(a->prev, bp);
}
- Dan C.
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
esmLOCKprefetch( an = esmLOCKload(a->next) );
esmLOCKprefetch( ap = esmLOCKload(a->prev) );
esmLOCKprefetch( bn = esmLOCKload(b->next) );
esmLOCKprefetch( bp = esmLOCKload(b->prev) );
if (Ehead == a)
*head = b;
else if (Ehead == b)
*head = a;
b->next = an;
if (an != NULL) {
an->prev = b;
}
b->prev = ap;
if (ap != NULL) {
ap->next = b;
}
a->next = bn;
if (bn != NULL) {
bn->prev = a;
}
if (bp != NULL) {
bp->next = a;
}
esmLOCKstore(a->prev, bp);
}
On Thu, 27 Mar 2025 17:19:21 +0000, Dan Cross wrote:
In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
[snip]
1. `from` may point to a cache line
2. `to` may point to a different cache line.
3. Does your architecture use a stack?
No
You make multiple references to a stack just below. I
understand stack references are excluded from the atomic event,
but I asked whether the architecture uses a stack. Does it, or
doesn't it?
What sort of alignment criteria do you impose?
On the participating data none. On the cache line--cache line.
I meant for data on the stack. The point is moot, though, since
as you said stack data does not participate in the atomic event.
Stack does not NECESSARILY participate. However, if you perform
a esmLOCKload( SP[23] ) then that line on the stack is participating.
At first blush, it seems to me that
the pointers `from_next`, `from_prev`, and `to_next` could be
on the stack and if so, will be on at least one cache line,
and possibly 2, if the call frame for `MoveElement` spans
more than one.
The illustrated code is using 6 participating cache lines.
Where local variables are kept (stack, registers) does not
count against the tally of participating cache lines.
Huh.
There are 6 esmLOCKxxxxx() so there are 6 participating lines--
and these are absolutely independent from where the variables
are located. The variables carry information but do not part-
ticipate they carry data between operations on the particupating
lines.
How would this handle something like an MCS lock, where
the lock node may be on a stack, though accessible globally in
some virtual address space?
As illustrated above esmLOCKload( SP[23]) will cause a line on
the stack to participate in the event.
----------------
But later you wrote,
So, if you want the property whereby the lock disappears on any >>>>>control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>>>then you want to use my ATOMIC stuff; otherwise, you can use the >>>>>normal ATOMIC primitives everyone and his brother provide.
Well, what precisely do you mean here when you say, "if you want
the property whereby the lock disappears on any control transfer
out of the event"?
If you want that property--you use the tools at hand.
If you don't just use them as primitive generators.
I wasn't clear enough here. I'm asking what, exactly, you mean
by this _property_,
The property that if an interrupt (or exception or SVC or SVR)
prevents executing the event to its conclusion, HW makes the
event look like it never started. So that when/if control returns
we have another chance to execute the event as-if ATOMICally.
not what you mean when you write that one
can use these atomic events if one wants the property. That is,
I'm asking you to precisely define what it means for a lock to
"disappear".
It seems clear enough _now_ that once the event concludes
successfully the lock value is set to whatever it was set to in
during the event, but that wasn't clear to me earlier and I
wanted confirmation.
Not "the lock value" but "all participating values" are set
and become visible system-wide in a single instant.
No 3rd
party sees some set and others remain unset. Everybody sees
all of then change value between this-clock and its successor.
In particular, it seems odd to me that one would bother with a
lock of some kind during an event if it didn't remain set after
the event.
Consider CompareTripleSwapFour() as an atomic primitive. How
would you program this such that nobody nowhere could ever
tell that the four updated locations changes in any order
then simultaneously ??? Without using a LOCK to guard the
event ??? And without demanding that all four updates are
to the same cache line ???
{{Nobody nowhere included ATOMIC-blind DMA requests}}
If you do all of your critical section stuff inside
of the event, and setting the lock in the event is not visible
until the event concludes, why bother? If on the other hand you
use the event to set the lock, why bother doing additional work
inside the event itself, but in this case I definitely don't
want something else to come along and just unset the lock on its
own, higher priority or not.
It is clear you do not understand the trouble HW takes to implement
even DCADS where between any 2 clocks, one (or more) SNOOPs can
take the lines you are DCADSing. It is this property of cache
coherence that gets in the way of multi-operation ATOMIC events.
In 2004 Fred Weber came to me and ask: "Why can't we (AMD) give
Microsoft the DCADS they want. I dug into it, and ASF was my
solution in x86 ISA, ESM is my solution in RISC-based ISA. It
removes the need to add ATOMIC primitives to ISA over generations.
It also provides the ability to do other things--which apparently
you will not use--sort of like complaining that your new car has
auto-park a feature you will never use.
What I _hope_ you mean is that if you transfer "out of the
event" before the `esmLOCKstore` then anything that's been done
since the "event" started is rolled back, but NOT if control
transfer happens _after_ the `esmLOCKstore`. If THAT were the
case, then this entire mechanism is useless.
One can use a esmLOCKstore( *SP, 0 ) to abandon an event. HW
detects that *SP is not participating in the event, and uses
the lock bit in the instruction to know it is time to leave
the event in the failed manner. No ST is performed to the >>>non-participating cache line. {{I guess I could wrap the
abandonment into a macro ...}}
Again, not what I was asking. I'm trying to understand what you
meant when you talked about disappearing locks.
If the event cannot be completed due to interrupt or exception or
SCVC or SVR, the HW backs up the event such that is looks like
it never started. SO when control return, the ATOMIC event runs
in its entirety ATOMICally !! So, the illusion of ATOMICITY is
preserved.
Again, I think you meant that, if an atomic event that sets a
lock variable fails,
The event has NO lock variable--it has up to 8 cache line address
monitors.
whether it got far enough along to set the
variable inside the event or not is not observable if the event
fails.
Non-participating cache lines can be used to leak information
out of the event (like for debug purposes). Participating data
does not leak.
But if the event succeeds, then the variable is set and
that's that. IF that's the case, then great! But if not, then
disappearing locks are bad.
It is NOT a FRIGGEN LOCK--
it is a set of address monitors which
are used to determine the success of the event -- the monitors
remain invisible to SW just like miss buffers remain invisible
to SW. The monitors are not preservable across privilege changes.
This is especially murky to me
because there seem to be bits of OS primitives, like threads and
priority, that are known directly to the hardware.
Just its instantaneous privilege level--which is congruent to
its register file, Root Pointer, ASID, and operating modes.
------------------
Ok, so perhaps this is a reasonable implementation of spinlock
acquire, as I wrote to Stefan yesterday:
void
spinlock_acquire(volatile Spinlock *lk)
{
while (esmLOCKload(&lk->lock) != 0)
cpu_spin_hint();
esmLOCKstore(&lk->lock, 1);
}
Yes? If this is preempted, or another CPU wins the race on the
lock, the hardware will back this up and start over in the while
loop, right?
Yes.
------------
Remember only the lines participating in the event count. For
example, say one had sprintf() inside the event. None of the
lines touched by springt() are participating in the event,
and are not counted against the 8 cache lines available.
This doesn't change the point, though. Anything that touches
more than 8 cache lines is not suitable for an atomic event.
I see what you are saying, but the event is allowed to have an
unbounded amount of touches to non-participating lies.
Insertion into a doubly linked list is already up to 6 lines,
I moved a doubly linked element from a CDS and inserted it
somewhere else doubly-linked in same CDS in 6 cache lines.
You need to refine how you count.
BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}
recall that my original hypothetical also involved popping off a
queue, which adds another (the queue head). If I threw up
incrementing a couple of counters I can't do it. And this seems
like a perfectly reasonable real-world scenario; consider a work
stealing thread scheduler that takes a task from one core to
another and keeps some stats on how things move.
Or maybe even just swapping places between two elements in a
linked list. For example:
The example has 3 structures each requiring 3 participating lines.
{themself, the struct their next pointer points at and the struct
their prev pointer points at}
3×3 = 9 and there is no getting around it.
-------------------------
But there isn't a lot of publicly available documentation for
any of this, at least none that I could find, and from the way
it's been described thus far it seems like there are a number of
edge cases that it is not clear how one is supposed to address.
So I want to understand the boundaries of where these hardware
facilities are applicable, given architectural limitations. I
think this thread has shown that those boundaries exist. The
universe of interesting and useful software is big, and someone,
somewhere, is going to trip over those limitations sooner or
later. Once we accept that, we must ask, "Then what?" It seems
obvious that programmers are going to have to find techniques to
work around those limitations, and that the mechanisms chosen
are unknowable to hardware.
Fair enough.
But now we're back in the same boat of software having to deal
with all the thorny problems that these mechanisms were designed
to address in the first place, like priority inversion: take the
spinlock from earlier, the hardware has no semantic knowledge
that the lock is used as a mutex, and can't know; it's just some
bit in memory somewhere that was set during an atomic event.
It is not a FRIGGEN lock, it is a set of address monitors which
guard the instructions in the event.
A Lock is a SW concept
which requires a value change to announce to others that "you
got it".
Maybe ESM helps in a race when trying to acquire the spinlock,
but once it's acquired, it's acquired, and because the spinlock
is not a primitive understood by the hardware, the hardware has
no way of incorporating the spinlock's semantics into its view
of priority, threading, etc.
Consider a 1024 core system and the time-quantum goes off and
every core wants to rotate the thread it is running; putting
it at the back of the execution queue and taking off the one on
the front to run. Nick McLaren timed a big SUN server (100 core)
on this and it could take up to 6 seconds !! for something that
should take a few milliseconds. This is because of the BigO( n^3 )
nature of bus traffic wrt SNOOPing on that system.
One can use ESM to convert that particular case to BigO( 3 )
or in the general case of random insert and removal times:
BigO( ln( n ) ).
So while an atomic event might help me in the implementation of
a generic mutex of some kind, so that I can protect critical
sections that can't be handled with the atomic event framework
directly because of their size, that's the extent of the help it
can give me, because once the mutex is acquired the hardware has
no insight into the fact that it's a mutex.
Exactly:: once you convert address-based mutual exclusion to
data-based mutual exclusion, HW no longer sees it at all.
So if some higher
priority thing subsequently comes along and preempts the thread
holding that lock and tries to take that lock itself, then the
hardware can't help me out anymore, and I'm back to a
priority-inversion deadlock that has to be dealt with by
software.
HW can only "take" address based locking, not data-based locking.
The point is, even if the hardware gives me a cool
mechanism for helping prevent priority inversion problems,
programmers are inevitably going to have to handle those
situations themselves anyway.
Never promised you a rose garden.
This is the sort of thing I'm trying to communicate, it's really
not me sitting here tossing out barbs like, "oh yeah?! Well,
how about THIS one, smart guy?"
Look at it this way:: over the decade from 2000-2010 x86 added
a new ATOMIC instruction every iteration.
Using ASF or ESM; my architecture would never have to add one.
SW has the tools to build what it wants/needs.
In article <601781e2c91d42a73526562b419fdf02@www.novabbs.org>,-----------------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
The property that if an interrupt (or exception or SVC or SVR)
prevents executing the event to its conclusion, HW makes the
event look like it never started. So that when/if control returns
we have another chance to execute the event as-if ATOMICally.
Ok. So let's be crystal clear here: after the event concludes
successfully, a subsequent exception, interrupt, or whatever,
will not magically roll back any of the values set and made
visible system-wide as a result of the successful conclusion of
the atomic event. Correct? That's what I've been asking.
- Dan C.
Ok. So let's be crystal clear here: after the event concludes
successfully, a subsequent exception, interrupt, or whatever,
will not magically roll back any of the values set and made
visible system-wide as a result of the successful conclusion of
the atomic event. Correct? That's what I've been asking.
If the esmLOCKstore() retires, the participating memory becomes visible
at that instant and cannot be subsequently undone.
If the esmLOCKabandon( ) retires, participating memory reverts to its pre-event state in that instant as if the event never happened.
In article <601781e2c91d42a73526562b419fdf02@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Thu, 27 Mar 2025 17:19:21 +0000, Dan Cross wrote:
In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
[snip]
1. `from` may point to a cache line
2. `to` may point to a different cache line.
3. Does your architecture use a stack?
No
You make multiple references to a stack just below. I
understand stack references are excluded from the atomic event,
but I asked whether the architecture uses a stack. Does it, or
doesn't it?
What sort of alignment criteria do you impose?
On the participating data none. On the cache line--cache line.
I meant for data on the stack. The point is moot, though, since
as you said stack data does not participate in the atomic event.
Stack does not NECESSARILY participate. However, if you perform
a esmLOCKload( SP[23] ) then that line on the stack is participating.
Very good.
At first blush, it seems to me that
the pointers `from_next`, `from_prev`, and `to_next` could be
on the stack and if so, will be on at least one cache line,
and possibly 2, if the call frame for `MoveElement` spans
more than one.
The illustrated code is using 6 participating cache lines.
Where local variables are kept (stack, registers) does not
count against the tally of participating cache lines.
Huh.
There are 6 esmLOCKxxxxx() so there are 6 participating lines--
and these are absolutely independent from where the variables
are located. The variables carry information but do not part-
ticipate they carry data between operations on the particupating
lines.
That was clearly in the code. That was not what the "Huh" was
in resposne to.
How would this handle something like an MCS lock, where
the lock node may be on a stack, though accessible globally in
some virtual address space?
As illustrated above esmLOCKload( SP[23]) will cause a line on
the stack to participate in the event.
Ok, good to know.
----------------
But later you wrote,
So, if you want the property whereby the lock disappears on any >>>>>>control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>>>>then you want to use my ATOMIC stuff; otherwise, you can use the >>>>>>normal ATOMIC primitives everyone and his brother provide.
Well, what precisely do you mean here when you say, "if you want
the property whereby the lock disappears on any control transfer
out of the event"?
If you want that property--you use the tools at hand.
If you don't just use them as primitive generators.
I wasn't clear enough here. I'm asking what, exactly, you mean
by this _property_,
The property that if an interrupt (or exception or SVC or SVR)
prevents executing the event to its conclusion, HW makes the
event look like it never started. So that when/if control returns
we have another chance to execute the event as-if ATOMICally.
Ok. So let's be crystal clear here: after the event concludes
successfully, a subsequent exception, interrupt, or whatever,
will not magically roll back any of the values set and made
visible system-wide as a result of the successful conclusion of
the atomic event. Correct? That's what I've been asking.
not what you mean when you write that one
can use these atomic events if one wants the property. That is,
I'm asking you to precisely define what it means for a lock to
"disappear".
It seems clear enough _now_ that once the event concludes
successfully the lock value is set to whatever it was set to in
during the event, but that wasn't clear to me earlier and I
wanted confirmation.
Not "the lock value" but "all participating values" are set
and become visible system-wide in a single instant.
Sure. But there was only one participating value in the
spinlock example. ;-}
No 3rd
party sees some set and others remain unset. Everybody sees
all of then change value between this-clock and its successor.
Sure. You've said this all along, and I've never disputed it.
In particular, it seems odd to me that one would bother with a
lock of some kind during an event if it didn't remain set after
the event.
Consider CompareTripleSwapFour() as an atomic primitive. How
would you program this such that nobody nowhere could ever
tell that the four updated locations changes in any order
then simultaneously ??? Without using a LOCK to guard the
event ??? And without demanding that all four updates are
to the same cache line ???
{{Nobody nowhere included ATOMIC-blind DMA requests}}
The fundamental issue here is that, when I refer to a "lock" I
have been talking about a software mutual exclusion primitive,
and you are referring to something else entirely.
I would have thought my meaning was clear from context;
evidently not.
If you do all of your critical section stuff inside
of the event, and setting the lock in the event is not visible
until the event concludes, why bother? If on the other hand you
use the event to set the lock, why bother doing additional work
inside the event itself, but in this case I definitely don't
want something else to come along and just unset the lock on its
own, higher priority or not.
It is clear you do not understand the trouble HW takes to implement
even DCADS where between any 2 clocks, one (or more) SNOOPs can
take the lines you are DCADSing. It is this property of cache
coherence that gets in the way of multi-operation ATOMIC events.
As above, when I refer to "setting the lock" I am referring to
acquiring a mutex; a software primitive. I am not referring to
what the hardware does internally. Consider what I wrote above,
the section that you quoted, in that context.
In 2004 Fred Weber came to me and ask: "Why can't we (AMD) give
Microsoft the DCADS they want. I dug into it, and ASF was my
solution in x86 ISA, ESM is my solution in RISC-based ISA. It
removes the need to add ATOMIC primitives to ISA over generations.
It also provides the ability to do other things--which apparently
you will not use--sort of like complaining that your new car has
auto-park a feature you will never use.
"It is clear you do not understand the trouble SW takes to
implement even simple things that are beyond the capability of
hardware."
(Back at you. I think I've been pretty polite here; I'm getting
tired of the barbs.)
What I _hope_ you mean is that if you transfer "out of the
event" before the `esmLOCKstore` then anything that's been done
since the "event" started is rolled back, but NOT if control
transfer happens _after_ the `esmLOCKstore`. If THAT were the
case, then this entire mechanism is useless.
One can use a esmLOCKstore( *SP, 0 ) to abandon an event. HW
detects that *SP is not participating in the event, and uses
the lock bit in the instruction to know it is time to leave
the event in the failed manner. No ST is performed to the >>>>non-participating cache line. {{I guess I could wrap the
abandonment into a macro ...}}
Again, not what I was asking. I'm trying to understand what you
meant when you talked about disappearing locks.
If the event cannot be completed due to interrupt or exception or
SCVC or SVR, the HW backs up the event such that is looks like
it never started. SO when control return, the ATOMIC event runs
in its entirety ATOMICally !! So, the illusion of ATOMICITY is
preserved.
Sigh. You've said this many times over now. I get it. I'm not
disputing it, nor do I believe I ever have.
Again, when I said "lock" earlier, I am referring to a mutux
from the perspective of software. That is, the value of the
structure member that I called "lock" in the version of the
`Spinlock` struct that I posted however-many responses ago, and
that I keep referring to in example code. I am not asking about
what the hardware does internally. I'm not talking about locked
bus cycles or whatever. I'm not referring to cache line address
monitors or any of these other things that you keep referring
to. I'm talking the the value of the structure member called
`lock` in the `Spinlock` C structure I posted earlier.
What I have been asking, over and over again, is whether that
_value_ "disappears" _after_ an atomic event successfully
concludes. That is, the value that was written into the memory,
not whatever internal marker you use in hardware to monitor and
guard those stores until the atomic event completes or is
otherwise exited, aborted, cancelled, interrupted, or however
you prefer to refer to it not completing successfully. That is
a totally separate thing and I not presume to speak of how its
done, though in fairness to you you've made it reasonably clear.
Does that clarify what I have been asking about? I honestly
thought this would be obvious from context.
Again, I think you meant that, if an atomic event that sets a
lock variable fails,
The event has NO lock variable--it has up to 8 cache line address
monitors.
The structure member is literally called `lock` and the context
is software spinlocks. Note, spinlocks because they're simple
and easy to discuss, not because they're great mutex primtives.
As frustrating as this evidently is to you, it strikes me that
it actually reinforces my point: hardware has no insight into
what the software is _doing_ at a higher semantic level.
whether it got far enough along to set the
variable inside the event or not is not observable if the event
fails.
Non-participating cache lines can be used to leak information
out of the event (like for debug purposes). Participating data
does not leak.
But if the event succeeds, then the variable is set and
that's that. IF that's the case, then great! But if not, then
disappearing locks are bad.
It is NOT a FRIGGEN LOCK--
I am talking about the spinlock. That is a software object, and
in that context, it is very much called a "lock".
I understand when you say that it is not a lock, it is not a
lock in the sense relevant to your world in hardware, but that's
not what _I'm_ talking about.
I'm trying to square what you meant when you talked about
"disappearing locks" with what happens to those software objects
_after_ an atomic event modifying them successfully completes.
Evidently you were talking about something else entirely, and
that's fine. I gather it has nothing to do with _my_ notion of
locks after they've been successfully acquired; I was just
trying to confirm that.
it is a set of address monitors which
are used to determine the success of the event -- the monitors
remain invisible to SW just like miss buffers remain invisible
to SW. The monitors are not preservable across privilege changes.
See above. Tell me what you want me to call it and I'll use
that terminology instead.
This is especially murky to me
because there seem to be bits of OS primitives, like threads and
priority, that are known directly to the hardware.
Just its instantaneous privilege level--which is congruent to
its register file, Root Pointer, ASID, and operating modes.
What is "it" here?
------------------
Ok, so perhaps this is a reasonable implementation of spinlock
acquire, as I wrote to Stefan yesterday:
void
spinlock_acquire(volatile Spinlock *lk)
{
while (esmLOCKload(&lk->lock) != 0)
cpu_spin_hint();
esmLOCKstore(&lk->lock, 1);
}
Yes? If this is preempted, or another CPU wins the race on the
lock, the hardware will back this up and start over in the while
loop, right?
Yes.
------------
Remember only the lines participating in the event count. For
example, say one had sprintf() inside the event. None of the
lines touched by springt() are participating in the event,
and are not counted against the 8 cache lines available.
This doesn't change the point, though. Anything that touches
more than 8 cache lines is not suitable for an atomic event.
I see what you are saying, but the event is allowed to have an
unbounded amount of touches to non-participating lies.
I don't see how that's helpful if I need to touch data in more
than 8 cache lines atomically.
Insertion into a doubly linked list is already up to 6 lines,
I moved a doubly linked element from a CDS and inserted it
somewhere else doubly-linked in same CDS in 6 cache lines.
You need to refine how you count.
<Tongue-in-cheek mode on>
Ah, yes. The program that would dump core if you tried to move
from the head of a list or the rear and that couldn't insert at
the head pointer was, in fact, called `MoveElement`. My
mistake, even if it was clear that that's what I was talking
about, given that quoted where I referred to it in the context
of work stealing later on. ;-P
BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}
If you're going to be pendatic when I slip up and call "move"
"insert", then let's be pendantic.
This code won't compile, because of the obvious error of not
defining `tn`. Even if you fix that, it's not correct if you
try to insert at the end of the list. It's not clear now you
would insert at the head of a list, or into an empty list, but
that depends on how you represent the list itself. ie, the
head may just be a pointer to the first element, or it might be
an otherwise unused node; something like this:
typedef struct List List;
struct List {
Element head;
};
where the first element in the list is always `head.next`.
It would help if you used standard types, like `bool`, which has
been available standard for more than 25 years now since
introduced in C99 as `_Bool` with accompanying typedefs and
macros in `stdbool.h` for `bool`, `true` and `false`.
And doesn't this presume that `el->next`, `el->prev` and `el`
are all on the same cache line? Similarly with `tn` and
`tn->prev`?
You need to refine how you write programs. ;-}
Here's a better version, that assumes the list representation
above.
#include <stddef.h>
#include <stdbool.h>
typedef struct Element Element;
struct Element {
Element *next;
Element *prev;
};
typedef struct List List;
struct List {
Element head;
};
Element *
ListBegin(List *list)
{
return list->head.next;
}
bool
InsertElement(Element *el, Element *to)
{
Element *tn = esmLOCKload(to->next);
esmLOCKprefetch(el->next);
esmLOCKprefetch(el->prev);
if (tn != NULL)
esmLOCKprefetch(tn->prev);
if (!esmINTERFERENCE()) {
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore(tn->prev, el);
return true;
}
return false;
}
Inserting at the head, or if the list is empty, are both easy
now: `InsertElement(&list.head, el)`. But note, again, that the
first element of the list is really `list.head.next`, hence the
`ListBegin` function.
Here I count up to three, but possibly up to 4, cache lines.
Note that if you did Not represent the list this way, in order
to support generalized insertions anywhere in the list, you'd
either need a circular representation (which is annoying to use
in practice), or you'd most likely have to support separate
head and tail pointers and two operations, one to insert before,
and another after, the 'to' pointer. But then you'd have to be
prepared to update either the head or tail pointers (or both if
the list is empty); either way it's potentially two more cache
lines to account for, though if the list is empty you don't have
to touch `tn->prev`.
recall that my original hypothetical also involved popping off a
queue, which adds another (the queue head). If I threw up
incrementing a couple of counters I can't do it. And this seems
like a perfectly reasonable real-world scenario; consider a work
stealing thread scheduler that takes a task from one core to
another and keeps some stats on how things move.
Or maybe even just swapping places between two elements in a
linked list. For example:
The example has 3 structures each requiring 3 participating lines. >>{themself, the struct their next pointer points at and the struct
their prev pointer points at}
3×3 = 9 and there is no getting around it.
Yup. So what does the hardware do if I try it?
-------------------------
But there isn't a lot of publicly available documentation for
any of this, at least none that I could find, and from the way
it's been described thus far it seems like there are a number of
edge cases that it is not clear how one is supposed to address.
So I want to understand the boundaries of where these hardware
facilities are applicable, given architectural limitations. I
think this thread has shown that those boundaries exist. The
universe of interesting and useful software is big, and someone,
somewhere, is going to trip over those limitations sooner or
later. Once we accept that, we must ask, "Then what?" It seems
obvious that programmers are going to have to find techniques to
work around those limitations, and that the mechanisms chosen
are unknowable to hardware.
Fair enough.
But now we're back in the same boat of software having to deal
with all the thorny problems that these mechanisms were designed
to address in the first place, like priority inversion: take the
spinlock from earlier, the hardware has no semantic knowledge
that the lock is used as a mutex, and can't know; it's just some
bit in memory somewhere that was set during an atomic event.
It is not a FRIGGEN lock, it is a set of address monitors which
guard the instructions in the event.
Now I have to wonder if you're deliberatly ignoring the context.
I called it a "lock" in the same sentence I said it was a
"spinlock". Are you seriously going to tell me that you could
not figure out that the "lock" I was referring to was the
spinlock? You know that that's a software thing; you yourself
said that.
Note the words, "it's just some bit in memory that _was set
during an atomic event._" Is it not clear that the event is
over? It's done. It was successful. All of the address
monitors guarding instructions in the event are no longer
relevant; they've been retired or whatever.
Now, some (software) thread holds a (software) (spin)lock.
I've tried over and over to make this clear, yet you insist on
going back to talking about what happens _during_ an atomic
event. I'm talking about _after_ the event completed
_succesfully_.
A Lock is a SW concept
which requires a value change to announce to others that "you
got it".
That's exactly what `spinlock_acquire` from above _does_, isn't
it?
You cannot seriously expect me to accept that you did not
understand this.
Maybe ESM helps in a race when trying to acquire the spinlock,
but once it's acquired, it's acquired, and because the spinlock
is not a primitive understood by the hardware, the hardware has
no way of incorporating the spinlock's semantics into its view
of priority, threading, etc.
Consider a 1024 core system and the time-quantum goes off and
every core wants to rotate the thread it is running; putting
it at the back of the execution queue and taking off the one on
the front to run. Nick McLaren timed a big SUN server (100 core)
on this and it could take up to 6 seconds !! for something that
should take a few milliseconds. This is because of the BigO( n^3 )
nature of bus traffic wrt SNOOPing on that system.
This was solved with per-core run queues.
You're senior enough that I would think it clear that spinlocks
are just a convenient illustrative example here because they're
so simple, but their pitfalls as core counts increase are well
studied and understood. In the real world, mutex structures
like MCS locks or CLH locks scale approximately linearly with
core-count under contention. I posted a reference to this
earlier, but here it is again. You may have a hard time
accepting it because they call the things that they refer to,
"locks". ;-}
https://pdos.csail.mit.edu/papers/linux:lock.pdf
One can use ESM to convert that particular case to BigO( 3 )
or in the general case of random insert and removal times:
BigO( ln( n ) ).
Modern scheduling algorithms already run in amortized constant
time on conventional hardware.
So while an atomic event might help me in the implementation of
a generic mutex of some kind, so that I can protect critical
sections that can't be handled with the atomic event framework
directly because of their size, that's the extent of the help it
can give me, because once the mutex is acquired the hardware has
no insight into the fact that it's a mutex.
Exactly:: once you convert address-based mutual exclusion to
data-based mutual exclusion, HW no longer sees it at all.
That's what I've been saying all along.
So if some higher
priority thing subsequently comes along and preempts the thread
holding that lock and tries to take that lock itself, then the
hardware can't help me out anymore, and I'm back to a
priority-inversion deadlock that has to be dealt with by
software.
HW can only "take" address based locking, not data-based locking.
Precisely. So it can't help me with priority inversion
generally.
The point is, even if the hardware gives me a cool
mechanism for helping prevent priority inversion problems,
programmers are inevitably going to have to handle those
situations themselves anyway.
Never promised you a rose garden.
It's ok; I've already been through Parris Island. And OCS at
Quantico, AFG deployment, la la la. But that's neither here nor
there. Point is, I don't need your roses, and I probably don't
want your threads or anything more advanced than things that let
me build my own synchronization primitives, threads, priority
abstractions, and so on.
This is the sort of thing I'm trying to communicate, it's really
not me sitting here tossing out barbs like, "oh yeah?! Well,
how about THIS one, smart guy?"
Look at it this way:: over the decade from 2000-2010 x86 added
a new ATOMIC instruction every iteration.
Using ASF or ESM; my architecture would never have to add one.
SW has the tools to build what it wants/needs.
Ok, now we're getting somewhere useful. That's handy.
Look, while don't you put the documentation of this stuff
somewhere publicly accessible? It'd probably save people like
me a lot of time, frustration, and dad-level insults. Probably
save you some frustration as well.
- Dan C.
Somehow the post was pushed before being edited. There seems to be
some control sequence that finds and pushes the Post-button--this
in not the only web site this happens to me on.
On Fri, 28 Mar 2025 1:01:59 +0000, Dan Cross wrote:
In article <601781e2c91d42a73526562b419fdf02@www.novabbs.org>,-----------------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
The property that if an interrupt (or exception or SVC or SVR)
prevents executing the event to its conclusion, HW makes the
event look like it never started. So that when/if control returns
we have another chance to execute the event as-if ATOMICally.
Ok. So let's be crystal clear here: after the event concludes
successfully, a subsequent exception, interrupt, or whatever,
will not magically roll back any of the values set and made
visible system-wide as a result of the successful conclusion of
the atomic event. Correct? That's what I've been asking.
If the esmLOCKstore() retires, the participating memory becomes visible
at that instant and cannot be subsequently undone.
If the esmLOCKabandon( ) retires, participating memory reverts to its >pre-event state in that instant as if the event never happened.
Ok. So let's be crystal clear here: after the event concludes
successfully, a subsequent exception, interrupt, or whatever,
will not magically roll back any of the values set and made
visible system-wide as a result of the successful conclusion of
the atomic event. Correct? That's what I've been asking.
I don't really understand how that can be a question. It would
completely invalidate the notion of the chunk of code being executed >*atomically*.
If a value has been made visible system-wide, it's like sending an
email: you can't undo that.
On Thu, 27 Mar 2025 17:19:21 +0000, Dan Cross wrote:
In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,-------------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
Or maybe even just swapping places between two elements in a
linked list. For example:
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
esmLOCKprefetch(*head);
This should be a load not prefetch--you want the value of *head
if (*head == a) // see !
*head = b;
else if (*head == b)
*head = a;
There is a ESM rule that states:: all participating cache lines
must be touched before any participating cache lines can be
modified.
Also note: Participating cache lines are checked for write permission
at touch time, and on cache miss, read with intent to modify.
an = esmLOCKload(a->next);
ap = esmLOCKload(a->prev);
bn = esmLOCKload(b->next);
bp = esmLOCKload(b->prev);
b->next = an;
if (an != NULL) {
esmLOCKprefetch(an->prev);
an->prev = b;
}
b->prev = ap;
if (ap != NULL) {
esmLOCKprefetch(ap->next);
ap->next = b;
}
a->next = bn;
if (bn != NULL) {
esmLOCKprefetch(bn->prev);
bn->prev = a;
}
if (bp != NULL) {
esmLOCKprefetch(bp->next);
bp->next = a;
}
esmLOCKstore(a->prev, bp);
}
What I think you want:: (ignoring the 9 participants limit)
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
top_of_ATOMIC_event:
// this is the recovery point is you don't use esmINTERFERENCE()
// the very next instruction begins the event.
esmLOCKprefetch( an = esmLOCKload(a->next) );
esmLOCKprefetch( ap = esmLOCKload(a->prev) );
esmLOCKprefetch( bn = esmLOCKload(b->next) );
esmLOCKprefetch( bp = esmLOCKload(b->prev) );
Node *Ehead = esmLOCKload(*head);
// by placing all the the touching before any manifestation, you put
// all the touch latency* in the code before it has tried to damage any
// participating memory location. (*) and TLB latency and 2nd party
// observation of your event.
// this would be the point where you would insert if( esmINTERFERENCE(
))
// if you wanted control at a known failure point rather than at the
// top of the event on failure.
if (Ehead == a)
*head = b;
else if (Ehead == b)
*head = a;
b->next = an;
if (an != NULL) {
an->prev = b;
}
b->prev = ap;
if (ap != NULL) {
ap->next = b;
}
a->next = bn;
if (bn != NULL) {
bn->prev = a;
}
if (bp != NULL) {
bp->next = a;
}
esmLOCKstore(a->prev, bp);
}
// now manifestation has lowest possible latency (as seen by this core
alone)
On Thu, 13 Mar 2025 23:34:22 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 21:14:08 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:
Most modern devices advertise the MSI-X capability instead.
And why not:: its just a few more flip-flops and almost no more >>>>>sequencing logic.
I've seen several devices with more than 200 MSI-X vectors;
thats 96 bits per vector to store a full 64-bit address and
32-bit data payload.
At this point, with 200+ entries, flip-flops are not recommended,
instead these would be placed in a RAM of some sort. Since RAMs
come in 1KB and 2KB quanta; we have 1 of 2K and 1 of 1K and we
have 256 said message containers, with 1 cycle access (after
you get to that corner of some chip).
That is 200 _per function_. Consider a physical function that
supports the SRIOV capability and configures 2048 virtual
functions. So that's 2049 * 200 MSI-X vectors just for one
device. It's not unusual. These vectors are, of course,
stored on the device itself. Mostly in RAMs.
In My 66000, said device can place those value-holding containers
in actual DRAM should it want to punt the storage.
and decreases on-die storage. At that point, the device has an
unlimited number of "special things".
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 23:34:22 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 21:14:08 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:
Most modern devices advertise the MSI-X capability instead.
And why not:: its just a few more flip-flops and almost no more >>>>>>sequencing logic.
I've seen several devices with more than 200 MSI-X vectors;
thats 96 bits per vector to store a full 64-bit address and
32-bit data payload.
At this point, with 200+ entries, flip-flops are not recommended, >>>>instead these would be placed in a RAM of some sort. Since RAMs
come in 1KB and 2KB quanta; we have 1 of 2K and 1 of 1K and we
have 256 said message containers, with 1 cycle access (after
you get to that corner of some chip).
That is 200 _per function_. Consider a physical function that
supports the SRIOV capability and configures 2048 virtual
functions. So that's 2049 * 200 MSI-X vectors just for one
device. It's not unusual. These vectors are, of course,
stored on the device itself. Mostly in RAMs.
In My 66000, said device can place those value-holding containers
in actual DRAM should it want to punt the storage.
Any device can do that today if it is designed to do so. But that
requires an additional DMA operation to send an MSI-X interrupt;
the device must first read the address and data fields from
host dram (as configured by the device driver) before building
the inbound memory write TLP that gets sent from the device to
the root port.
Just adds latency and requires the driver to allocate space and
tell the device the base address of the array of vectors.
This adds latency
and decreases on-die storage. At that point, the device has an
unlimited number of "special things".
Most PCIe devices are third party IP, which you must live with.
Your on-chip PCIe-like devices can do as they wish, unless they're
standard IP such as a Synopsis SATA controller, or third party
network controller PCIe endpoint IP.
I will just note that the high-end SoCs with on-chip PCIe-like
devices that I'm familiar with all use SRAMs for the MSI-X vectors
(unless there is only a couple, in which case flops work).
The latency overhead of fetching the vector from DRAM is
prohibitive for high-speed devices such as network controllers.
On Fri, 14 Mar 2025 14:52:47 +0000, Scott Lurndal wrote:
We CPU guys deal with dozens of cores, each having 2×64KB L1 caches,
a 256KB-1024KB L2 cache, and have that dozen cores share a 16MB L3
cache. This means the chip contains 26,624 1KB SRAM macros.
Was thinking about this last night::
a) device goes up and reads DRAM via L3::MC and DRC
b) DRAM data is delivered to device 15ns later
c) device uses data to send MSI-X message to interrupt 'controller'
d) interrupt controller in L3 sees interrupt
{to hand waving accuracy}
So, we have dozen ns up the PCIe tree, dozen ns over the interconnect,
50ns in DRAM, dozens ns over the interconnect, dozens of ns down the
PCIe tree, 1ns at device, dozen ns up the PCIe tree, dozens across >interconnect, arriving at interrupt service port after 122 ns or
about the equivalent of 600± clocks to log the interrupt into the
table.
The Priority broadcast is going to take another dozen ns, core
request for interrupt will be another dozen to service controller,
even if the service port request is serviced instantaneously,
the MSI-X message does not arrive at core until 72ns after arriving
at service port--for a best case latency on the order of 200 ns
(or 1000 CPU cycles or ~ 2,000 instructions worth of execution.)
And that is under the assumption that no traffic interference
is encountered up or down the PCIe trees.
whereas::
if the device DRAM read request was known to contain an MSI-X
message,
The latency overhead of fetching the vector from DRAM is
prohibitive for high-speed devices such as network controllers.
Here we have the situation where one can context switch in a lower
number of clock cycles than one can deliver an interrupt from
a device to a servicing core.
On Fri, 14 Mar 2025 17:35:23 +0000, Scott Lurndal wrote:
Here we have the situation where one can context switch in a lower
number of clock cycles than one can deliver an interrupt from
a device to a servicing core.
Device needs to send an interrupt when vectors stored in host DRAM
instead of internal SRAM or flops:
- send non-posted MRD TLP to RC to fetch MSI-X address
- receiver (pcie controller (RC), for example) passes
MRD address to IOMMU for translation (assuming
the device and host don't implement ATS),
IOMMU translates (table walk latency) the
address from the TLP to a host physical
address (which could involve two levels of
translation, so up to 22 DRAM accesses (intel/amd/aarch64)
on IOMMU TLB miss). The latency is dependent
up on the IOMMU table format - Intel has EPT
while ARM and AMD use the same format as the CPU
page tables for the IOMMU tables.
(this leaves out any further latency hit when
using the PCI Page Request Interface (PRI) to make
the target page resident).
- LLC/DRAM satisfies the MRD and returns data to
PCIe controller, which sends a completion TLP
to device. LLC (minimum), DRAM (maximum) latency added.
- RC/host sends response with address to device
- Device sends non-posted MRD TLP to RC to fetch MSI-X Data
(32-bit). Again with the IOMMU, but will likely
hit TLB. Lesser latency than a miss, but nonzero.
- RC returns completion TLP to device.
- Device sends MWR TLP (data payload) with the translated
address to the root complex, which passes it to the
internal bus structure for routing to the final address
(interrupt controller).
- Then add the latency from the interrupt controller to
the target core (which may include making the target guest
resident).
That's a whole pile of latency to send an interrupt.
I bet the MSI-X messages would cache on the device rather well ...
as they change roughly at the rate of VM creation.
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 14 Mar 2025 14:52:47 +0000, Scott Lurndal wrote:
We CPU guys deal with dozens of cores, each having 2×64KB L1 caches,
a 256KB-1024KB L2 cache, and have that dozen cores share a 16MB L3
cache. This means the chip contains 26,624 1KB SRAM macros.
You've a lot more area to work with, and generally a more
recent process node.
Was thinking about this last night::
a) device goes up and reads DRAM via L3::MC and DRC
b) DRAM data is delivered to device 15ns later
15ns? That's optimistic and presumes a cache hit, right?
Don't forget to factor in PCIe latency (bus to RC and RC to endpoint).
c) device uses data to send MSI-X message to interrupt 'controller'
d) interrupt controller in L3 sees interrupt
{to hand waving accuracy}
So, we have dozen ns up the PCIe tree, dozen ns over the interconnect,
50ns in DRAM, dozens ns over the interconnect, dozens of ns down the
PCIe tree, 1ns at device, dozen ns up the PCIe tree, dozens across >>interconnect, arriving at interrupt service port after 122 ns or
about the equivalent of 600± clocks to log the interrupt into the
table.
The Priority broadcast is going to take another dozen ns, core
request for interrupt will be another dozen to service controller,
even if the service port request is serviced instantaneously,
the MSI-X message does not arrive at core until 72ns after arriving
at service port--for a best case latency on the order of 200 ns
(or 1000 CPU cycles or ~ 2,000 instructions worth of execution.)
And that is under the assumption that no traffic interference
is encountered up or down the PCIe trees.
whereas::
if the device DRAM read request was known to contain an MSI-X
message,
You can't know that a priori,
it's just another memory write
(or read if you need to fetch the address and data from DRAM)
TLP as part of the inbound DMA. Which needs to hit the IOMMU
first to translate the PCI memory space address to the host
physical address space address.
If the MSI-X tables were kept in DRAM, you also need to include
the IOMMU translation latency in the inbound path that fetches
the vector address and vector data (96-bits, so that's two
round trips from the device to memory). For a virtual function,
the MSI-X table is owned and managed by the guest, and all
transaction addresses from the device must be translated from
guest physical addresses to host physical addresses.
A miss in the IOMMU adds a _lot_ of latency to the request.
So, that's three round trips from the device to the
Uncore/RoC just to send a single interrupt from the device.
The latency overhead of fetching the vector from DRAM is
prohibitive for high-speed devices such as network controllers.
Here we have the situation where one can context switch in a lower
number of clock cycles than one can deliver an interrupt from
a device to a servicing core.
Device needs to send an interrupt when vectors stored in host DRAM
instead of internal SRAM or flops:
- send non-posted MRD TLP to RC to fetch MSI-X address
- receiver (pcie controller (RC), for example) passes
MRD address to IOMMU for translation (assuming
the device and host don't implement ATS),
IOMMU translates (table walk latency) the
address from the TLP to a host physical
address (which could involve two levels of
translation, so up to 22 DRAM accesses (intel/amd/aarch64)
on IOMMU TLB miss). The latency is dependent
up on the IOMMU table format - Intel has EPT
while ARM and AMD use the same format as the CPU
page tables for the IOMMU tables.
(this leaves out any further latency hit when
using the PCI Page Request Interface (PRI) to make
the target page resident).
- LLC/DRAM satisfies the MRD and returns data to
PCIe controller, which sends a completion TLP
to device. LLC (minimum), DRAM (maximum) latency added.
- RC/host sends response with address to device
- Device sends non-posted MRD TLP to RC to fetch MSI-X Data
(32-bit). Again with the IOMMU, but will likely
hit TLB. Lesser latency than a miss, but nonzero.
- RC returns completion TLP to device.
- Device sends MWR TLP (data payload) with the translated
address to the root complex, which passes it to the
internal bus structure for routing to the final address
(interrupt controller).
- Then add the latency from the interrupt controller to
the target core (which may include making the target guest
resident).
That's a whole pile of latency to send an interrupt.
Generally when I write queue-code, I use a dummy Node front/rear
such that the checks for Null are unnecessary (at the cost of
following 1 more ->next or ->prev). That is Q->head and Q->tail
are never NULL and when the queue is empty there is a Node which
carries the fact the queue is empty (not using NULL as a pointer).
In article <34434320650f5844b18b1c0b684acf43@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Thu, 27 Mar 2025 17:19:21 +0000, Dan Cross wrote:
In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,-------------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
Or maybe even just swapping places between two elements in a
linked list. For example:
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
esmLOCKprefetch(*head);
This should be a load not prefetch--you want the value of *head
I wondered about that. I ASSumed that `esmLOCKprefetch` was
enough to set up a cache line address monitor on `*head`, but if
you tell me I need `esmLOCKload`, then I need `esmLOCKload`:
hp = esmLOCKload(*head);
if (hp == a)
*head = b; // Note I need to set whatever head points to, not hp.
else if (hp == b)
*head = a;
(You can see a vestige of me trying that earlier by the `hp`
temporary that I left in when I posted that initially, but I
didn't like how it looked, so I changed it to what was posted.)
if (*head == a) // see !
*head = b;
else if (*head == b)
*head = a;
There is a ESM rule that states:: all participating cache lines
must be touched before any participating cache lines can be
modified.
Are these rules in a reference somewhere I can see? I kinda
reckoned `esmLOCKprefetch` would be enough to "touch" the line,
but I fully admit I was just guessing. Looks like `prefetch` is
mostly for the case where you'll store, but don't need to read?
Also note: Participating cache lines are checked for write permission
at touch time, and on cache miss, read with intent to modify.
Ok.
an = esmLOCKload(a->next);
ap = esmLOCKload(a->prev);
bn = esmLOCKload(b->next);
bp = esmLOCKload(b->prev);
b->next = an;
if (an != NULL) {
esmLOCKprefetch(an->prev);
an->prev = b;
}
b->prev = ap;
if (ap != NULL) {
esmLOCKprefetch(ap->next);
ap->next = b;
}
a->next = bn;
if (bn != NULL) {
esmLOCKprefetch(bn->prev);
bn->prev = a;
}
if (bp != NULL) {
esmLOCKprefetch(bp->next);
bp->next = a;
}
esmLOCKstore(a->prev, bp);
}
What I think you want:: (ignoring the 9 participants limit)
You mean 8 participants limit? That counting thing again. ;-}
What I really wanted was an example that exceeded that limit as
an expository vehicle for understanding what happens when the
limit is exceeded. What does the hardware do in that case?
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
top_of_ATOMIC_event:
// this is the recovery point is you don't use esmINTERFERENCE()
// the very next instruction begins the event.
Yup. Tracking.
esmLOCKprefetch( an = esmLOCKload(a->next) );
esmLOCKprefetch( ap = esmLOCKload(a->prev) );
esmLOCKprefetch( bn = esmLOCKload(b->next) );
esmLOCKprefetch( bp = esmLOCKload(b->prev) );
What exactly does this do? This "touches" whatever lines the
things that `an`, `ap`, `bn`, and `bp` are all in, as well as
the lines that the `a` and `b` prev and next pointers point
into as well, right?
Node *Ehead = esmLOCKload(*head);
You shoulda just used the `hp` local I left for you for exactly
this. :-D
// by placing all the the touching before any manifestation, you put
// all the touch latency* in the code before it has tried to damage any
// participating memory location. (*) and TLB latency and 2nd party
// observation of your event.
// this would be the point where you would insert if( esmINTERFERENCE(
))
// if you wanted control at a known failure point rather than at the
// top of the event on failure.
if (Ehead == a)
*head = b;
else if (Ehead == b)
*head = a;
b->next = an;
if (an != NULL) {
an->prev = b;
}
b->prev = ap;
if (ap != NULL) {
ap->next = b;
}
a->next = bn;
if (bn != NULL) {
bn->prev = a;
}
if (bp != NULL) {
bp->next = a;
}
esmLOCKstore(a->prev, bp);
}
// now manifestation has lowest possible latency (as seen by this core >>alone)
Doesn't this code now assume that `an->prev` is in the same
cache line as `an`, and similarly that `bn` is in the same line
as `bn->prev`? Absent an alignment constraint on the `Node`
type, that's not guaranteed given the definition posted earlier.
That may be an odd and ill-advised thing for a programmer to do
if they want their list type to work with atomic events, but it
is possible.
The node pointers and their respective `next` pointers are okay,
so I wonder if perhaps this might have been written as:
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
hp = esmLOCKload(*head);
esmLOCKprefetch(an = esmLOCKload(a->next));
ap = esmLOCKload(a->prev);
esmLOCKprefetch(bn = esmLOCKload(b->next));
bp = esmLOCKload(b->prev);
if (an != NULL) // I see what you did
esmLOCKprefetch(an->prev);
if (bn != NULL) {
esmLOCKprefetch(bn->prev);
bn->prev = a;
}
if (hp == a)
*head = b;
else if (hp == b)
*head = a;
b->next = an;
if (an != NULL)
an->prev = b;
b->prev = ap;
if (ap != NULL)
ap->next = b;
// illustrative code
a->next = bn; // ST Rbp,[Ra,#next]
if (bp != NULL) // PNE0 Rbp,T
bp->next = a; // ST Ra,[Rbp,#next]
esmLOCKstore(a->prev, bp);
}
But now the conditional testing whether or not `an` is `NULL` is
repeated. Is the additional branch overhead worth it here?
- Dan C.
In article <8uzEP.115550$Xq5f.66883@fx38.iad>,
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:
In article <kVgEP.1277108$_N6e.605199@fx17.iad>,I'm using terminology in common use on multiprocessor systems like
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:A spin lock is simply any lock where you spin trying to acquire
In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,Terminology: mutexes coordinate mutual exclusion between threads,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:Consider a thread that takes a spinlock; suppose some
Sometimes you really don't want to be interrupted.And sometimes you don't want to be interrupted unless the
"house is on fire"; I cannot see a time when "the house is
on fire" that you would not want to take the interrupt.
Is there one ?!?
high-priority interrupt comes in while the thread is holding
that lock. In response to the interrupt, software decides to
suspend the thread and switch some other thread; that thread
wants to lock the spin lock that the now-descheduled thread is
holding: a classic deadlock scenario.
spinlocks coordinate mutual exclusion between cpu cores.
Windows "critical sections" are mutexes with a fast path.
the lock, as opposed to a blocking synchronization protocol.
Here I'm using the terminology of Herlihy and Shavit [Her08].
VMS since 1980's and later WinNT.
Each OS data structure has different reentrancy requirements.
- Ones accessed by threads are guarded by mutexes.
Example on Windows would be the page tables as paging is done by threads. >> - Ones accessed by OS software interrupt level are guarded by spinlocks.
Example on Windows is the Scheduler tables.
- Ones accessed by HW interrupts are guarded by interrupt spinlocks.
Example on Windows is device interrupt service routines.
I'm not sure what you're saying here, or rather, how it is
relevant.
Your earlier statement was that "mutexes coordinate mutual
exclusion between threads, spinlocks coordinate mutual exclusion
between cpu cores." But this is not the case.
Taking VMS as an example, from [Gol94], sec 9.5 (page 255):
|A thread of execution running on one processor acquires a
|spinlock to serialize access to the data with threads of
|execution running on other processors. Before acquiring
|the spinlock, the thread of executions raises IPL to block
|accesses by other threads of execution running on the same
|processor. The IPL value is determined by the spinlock being
|locked.
Goldberg and Saravanan seems quite explicit that threads can
acquire and hold spinlocks on OpenVMS, which seems to be
something that you were objecting to?
Turning to Windows, [Rus12] says about spin locks on page 179:
|Testing and acquiring the lock in one instruction prevents a
|second thread from grabbing the lock between the time the first
|thread tests the variable and the time it acquires the lock.
Note that, again, the language here is thread-centric.
By the way, regarding terminology, on page 177 [Rus12] also
says,
|Sections of code that access a nonshareable resource are called
|_critical_ sections. To ensure correct code, only one thread
|at a time can execute in a critical section.
This is the generic definition of a critical section and note
that this is the context in which we are using the term
throughout this thread.
What you referred to earlier as a "critical section" in Windows
is a _userspace_ synchronization primitive with a kernel assist
and mostly unrelated to concerns around interruptibility. From
the description on page 201 of [Rus12], they seem similar to
futexes on Linux, though somewhat more limited: Russinovich et
al states that they cannot be used between processes, while a
Linux futex can if it's in a region of memory shared between
cooperating processes.
Traditional Unix "sleep" and "wakeup" are an example of aI didn't.
blocking protocol, where a thread may "sleep" on some "channel",
yielding the locking thread while the lock cannot be acquired,
presumably scheduling something else until the thread is later
marked runnable by virtual of something else calling "wakeup" on
the sleep channel.
But note that I am deliberately simplifying in order to
construct a particular scenario in which a light-weight
synchronization primitive is being used for mutual exclusion
between concurrent entities, hence mentioning spinlocks
specifically.
Suggesting that spin locks iare only applicable to
multiprocessing scenarios is flatly incorrect.
Your exacty words were, "spinlocks coordinate mutual exclusion
between cpu cores." See the text quoted above. This is
incorrect in that it ignores that threads are, in contxt, the
software primitive for the computational entities that are doing
the acquiring, and while the "between cpu core" part is the
usual case, it is not the only case.
I am saying the OS by design guards *ITS* different data structures
with different mechanisms, and those mechanism must only be used in
a very controlled manner. And if you want to do something that requires
access to one of *ITS* data structures then you must obey *ITS* rules.
This is axiomatic.
But I fail to see how it's relevant to the topic at hand. In
context we're discussing scenarios where a system may want to
disable interrupts entirely, and whether a particular aspect of
a hardware design is sufficient to implement critical sections
generally.
In Windows and VMS and from what I understand of *nix spinlocks guard most >> of the core OS data structures with spinlocks acquired at the software
interrupt level. If you are not at that level then the spinlock will not
provide the necessary synchronization for access to that data.
Precisely.
Software may use
the technique to spin on a "busy" bit on some IP in a device for
example, even on a uniprocessor system.
Yes, I'm aware of the technique. That wasn't the point.A valid response here might be, "don't context switch from theThat is exactly what DPC/SoftIrq are design to do - allow the bulk of
interrupt handler; use a DPC instead". That may be valid, but
it puts a constraint on the software designer that may be onerus
in practice: suppose the interrupt happens in the scheduler,
while examining a run queue or something. A DPC object must be
available, etc.
the kernel, like the scheduler, non-paged heap management, IO pre and
post processing, to be interrupted without causing reentrancy.
The higher level device interrupt enqueues a request to the lower software >>>> interrupt level, which is processed when it will not cause reentrancy. >>>>
That constraint is by design and any good OS will crash if violated.
Further, software must now consider the complexity of
potentially interruptable critical sections. From the
standpoint of reasoning about already-complex concurrency issues
it's simpler to be able to assert that (almost) all interrupt
delivery can be cheaply disabled entirely, save for very
special, specific, scenarios like NMIs. Potentially switching
away from a thread holding a spinlock sort of defeats the
purpose of a spinlock in the first place, which is a mutex
primitive designed to avoid the overhead of switching.
(It occurs to me that perhaps this is why you are going on about
how VMS and windows uses spin locks, and how they interact with
interrupt priority levels.)
When a particular OS defined spinlock is used to guard a particular OSThis is why I mentioned the terminology thing: threads do not holdSee above. Threads can certainly "hold" a spin lock, as they
spinlocks, they hold mutexes.
can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
page 202:
defined data structure, exactly WHEN the OS allows this decides how
the OS avoids all the nasty interrupt pitfalls you identified.
How is this relevant to what you wrote earlier and my response,
quoted just above? My statement was that threads can hold spin
locks. This is obviously true.
|On a uniprocessor, if a thread tries to acquire a spin lockYes, which is what I was saying.
|that is already held, it will loop forever. Multiprocessor
|algorithms, however, must operate correctly regardless of the
|number of processors, which means that they should handle the
|uniprocessor case as well. This requires strict adherence to
|the rule that threads not relinquish control of the CPU while
|holding a spin lock.
I am describing HOW that is accomplished inside an OS.
Why? First, it's well-known, and second, that's not relevant to
the discussion at hand.
But it also doesn't seem like that's what you're saying; you
asserted that a thread (a software thread in this context) does
not "hold" a spin lock; that's incorrect.
The way you prevent a thread from relinquishing control while holding a
spinlock is to first switch hats from thread level to software interrupt
level, which blocks the scheduler because it also runs at SWIL,
then acquire the spinlock.
Raising the interrupt priority level from passive (aka thread) level
to SWI level (called dispatch level on Windows) blocks the thread switch.
This ensures non-reentrant algorithms are serialized.
The core may then acquire spinlocks guarding data structures.
Just so. Earlier you said that threads cannot hold spin locks;
in the text quoted just above, you said that they do.
As for this specific scenario of specific interrupt priority
levels, I would go further and say that is just a detail. In
the simplest systems, ie, the kind we might reach for as an
expository example when discussing these kinds of issues, one
just disables interrupts completely when holding a spinlock.
And I am describing HOW the OS accomplishes this.Threads and mutexes are a fiction createdcases where the critical section guarded by the lock is short,
by the OS scheduler, and switching threads while waiting for a mutex
is exactly what they are supposed to do.
Spinlocks are used by cores to coordinate access to shared memory by
things like a OS scheduler looking at list of threads waiting for a mutex. >>> Spinlocks, of the form I was describing, are an optimization for
and the overhead of invoking a scheduler and context switching
to some other thread is higher than just spinning until the lock
becomes available again. The whole point is to avoid blocking
and switching.
E.g., as [Vaj11] puts it (page 98):
|Spin locks have the advantage of preventing pre-emption by the
|OS (due to blocking while waiting for the lock)
That is, the advantage is because the thread does not block and
yield.
Well, what you are describing is how _a_ specific family of
operating systems that you are familiar with accomplishes this,
and that's all well and good, but it's not all that relevant to
discussing the pitfalls of this "ATOMIC event" proposal or
addressing the OP's questions about scenarios in which interrupt
disable is used.
Disabling the scheduler is the same as raising the priority to SWI level.
Once the scheduler is disabled THEN you may acquire a spinlock
that guards a shared data structure accessed at SWI level.
But you may NOT access a shared data structure accessed at interrupt level. >> For that you'd need an interrupt spinlock acquired at the device ISR
priority interrupt level. And note that the ISR interrupt priority level
is assigned by the OS itself, usually when the device is mounted.
You are presuming a very specific system architecture and going
into great detail about its implementation. That's fine, and
perhaps useful I suppose, but again, I don't think it's very
relevant to the context of this discussion.
Syscall switches a thread from user mode to supervisor mode.I like to think of it as all having to do with hats.of "interrupts" in this model, regardless of whether one might
The cpu is wearing one of three hats: a thread hat when it pretends to be >>>> a time shared execution machine; a core hat when it acts as a single
execution machine running non-reentrant things like a scheduler which
creates the fiction of threads and processes (multiple virtual spaces); >>>> and an interrupt hat when executing nestable reentrant interrupts.
The interrupt mechanism coordinates exactly when and how we switch hats. >>> I suppose you are lumping system calls into the overall bucket
use a dedicated supervisor call instruction (e.g., something
like x86 SYSENTER or SYSCALL or ARM SVC or RISC-V's ECALL)?
Your statement was that "the interrupt mechanism coordinates
exactly when and how we switch hats". But that's incomplete; if
a system call blocks, for example, the typical behavior is to
yield to the scheduler to pick something else to do. Clearly,
that is another way that coordinates this business of changing
one's hat.
The hat analogy feels strained to me, and too colloquial to beI offered it because it answers your prior questions about how
useful.
OS avoid the kinds of deadlocks and race conditions you described.
I'm afraid you answered a question that I did not ask, then.
I am discussing the specifics of this architectural proposal,
and specifically two aspects of it: one is addressing the
proposer's questions about scenarios where one might want to
disable interrupts entirely.
Another, which has emerged somewhat more recently in the thread,
is this mechanism which seems to be a semi-transactional
elaboration of LL-SC primitives.
Any questions I did ask were Socratic in nature, and in the
context of these architectural discussions. I did not ask any
questions about basic design or implementation of
synchronization primitives.
- Dan C.
References:
[Gol94] Ruth E. Goldberg and Saro Saravanan. 1994. _OpenVMS AXP
Internals and Data Structures_ (Vers. 1.5). Digital Press,
Newton, MA.
[Rus12] Mark Russinovich, David A. Solomon and Alex Ionescu.
2012. _Windows Internals: Part 1_. Microsoft Press, Redmond, WA.
On 3/27/2025 7:02 PM, Stefan Monnier wrote:
Thinking about Haskell's STM (Software Transactional Memory) library:Oh shit. I have had some bad experiences with STM. Way back. Damn live lock scenarios, ect... Issues with LL/SC live lock wrt false sharing on the reservation granules, ect...
On Fri, 28 Mar 2025 2:53:57 +0000, Dan Cross wrote:
[snip]
What I really wanted was an example that exceeded that limit as
an expository vehicle for understanding what happens when the
limit is exceeded. What does the hardware do in that case?
Raise OPERATION (instruction) Fault.
[snip]
// by placing all the the touching before any manifestation, you put
// all the touch latency* in the code before it has tried to damage any >>>// participating memory location. (*) and TLB latency and 2nd party
// observation of your event.
// this would be the point where you would insert if( esmINTERFERENCE(
))
// if you wanted control at a known failure point rather than at the
// top of the event on failure.
if (Ehead == a)
*head = b;
else if (Ehead == b)
*head = a;
b->next = an;
if (an != NULL) {
an->prev = b;
}
b->prev = ap;
if (ap != NULL) {
ap->next = b;
}
a->next = bn;
if (bn != NULL) {
bn->prev = a;
}
if (bp != NULL) {
bp->next = a;
}
esmLOCKstore(a->prev, bp);
}
// now manifestation has lowest possible latency (as seen by this core >>>alone)
Doesn't this code now assume that `an->prev` is in the same
cache line as `an`, and similarly that `bn` is in the same line
as `bn->prev`? Absent an alignment constraint on the `Node`
type, that's not guaranteed given the definition posted earlier.
Everything you state is true, I just tried to move the esmLOCKxxxx
up to the setup phase to obey ESM rules.
That may be an odd and ill-advised thing for a programmer to do
if they want their list type to work with atomic events, but it
is possible.
The node pointers and their respective `next` pointers are okay,
so I wonder if perhaps this might have been written as:
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
hp = esmLOCKload(*head);
esmLOCKprefetch(an = esmLOCKload(a->next));
ap = esmLOCKload(a->prev);
esmLOCKprefetch(bn = esmLOCKload(b->next));
bp = esmLOCKload(b->prev);
if (an != NULL) // I see what you did
esmLOCKprefetch(an->prev);
if (bn != NULL) {
esmLOCKprefetch(bn->prev);
bn->prev = a;
}
if (hp == a)
*head = b;
else if (hp == b)
*head = a;
b->next = an;
if (an != NULL)
an->prev = b;
b->prev = ap;
if (ap != NULL)
ap->next = b;
// illustrative code
a->next = bn; // ST Rbp,[Ra,#next]
if (bp != NULL) // PNE0 Rbp,T
bp->next = a; // ST Ra,[Rbp,#next]
esmLOCKstore(a->prev, bp);
}
But now the conditional testing whether or not `an` is `NULL` is
repeated. Is the additional branch overhead worth it here?
In My 66000 ISA, a compare against zero (or NULL) is just a branch >instruction, so the CMP zero is performed twice, but each use is
but a single Branch-on-condition instruction (or Predicate-on-
Condition instruction).
In the case of using predicates, FETCH/DECODE will simple issue
both then and else clauses into the execution window (else-clause
is empty here) and let the reservation stations handle execution
order. And the condition latency is purely the register dependence
chain. A 6-wide machine should have no trouble in inserting two
copies of the code commented by "illustrative code" above--in
this case 6-instructions or 2 sets of {ST, PNE0, ST}.
In the case of using a real branch, latency per use is likely to
be 2-cycles, moderated by typical branch prediction. The condition
will have resolved early, so we are simply FETCH/DECODE/TAKE bound.
{{That is: PRED should be faster in almost all conceivable cases.}}
Generally when I write queue-code, I use a dummy Node front/rear
such that the checks for Null are unnecessary (at the cost of
following 1 more ->next or ->prev). That is Q->head and Q->tail
are never NULL and when the queue is empty there is a Node which
carries the fact the queue is empty (not using NULL as a pointer).
But that is just my style.
In article <cb049d5490b541878e264cedf95168e1@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Fri, 28 Mar 2025 2:53:57 +0000, Dan Cross wrote:
[snip]
What I really wanted was an example that exceeded that limit as
an expository vehicle for understanding what happens when the
limit is exceeded. What does the hardware do in that case?
Raise OPERATION (instruction) Fault.
Ok, very good. The software implications could be profound.
[snip]---------
Doesn't this code now assume that `an->prev` is in the same
cache line as `an`, and similarly that `bn` is in the same line
as `bn->prev`? Absent an alignment constraint on the `Node`
type, that's not guaranteed given the definition posted earlier.
Everything you state is true, I just tried to move the esmLOCKxxxx
up to the setup phase to obey ESM rules.
Yeah, that makes sense. Describing it as a rule, however,
raises the question of whether one must touch a line before a
store to a different line, or just before a store to that line?
That may be an odd and ill-advised thing for a programmer to do
if they want their list type to work with atomic events, but it
is possible.
The node pointers and their respective `next` pointers are okay,
so I wonder if perhaps this might have been written as:
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
hp = esmLOCKload(*head);
esmLOCKprefetch(an = esmLOCKload(a->next));
ap = esmLOCKload(a->prev);
esmLOCKprefetch(bn = esmLOCKload(b->next));
bp = esmLOCKload(b->prev);
if (an != NULL) // I see what you did
esmLOCKprefetch(an->prev);
if (bn != NULL) {
esmLOCKprefetch(bn->prev);
bn->prev = a;
}
if (hp == a)
*head = b;
else if (hp == b)
*head = a;
b->next = an;
if (an != NULL)
an->prev = b;
b->prev = ap;
if (ap != NULL)
ap->next = b;
// illustrative code
a->next = bn; // ST Rbp,[Ra,#next]
if (bp != NULL) // PNE0 Rbp,T
bp->next = a; // ST Ra,[Rbp,#next]
esmLOCKstore(a->prev, bp);
}
But now the conditional testing whether or not `an` is `NULL` is
repeated. Is the additional branch overhead worth it here?
In My 66000 ISA, a compare against zero (or NULL) is just a branch >>instruction, so the CMP zero is performed twice, but each use is
but a single Branch-on-condition instruction (or Predicate-on-
Condition instruction).
In the case of using predicates, FETCH/DECODE will simple issue
both then and else clauses into the execution window (else-clause
is empty here) and let the reservation stations handle execution
order. And the condition latency is purely the register dependence
chain. A 6-wide machine should have no trouble in inserting two
copies of the code commented by "illustrative code" above--in
this case 6-instructions or 2 sets of {ST, PNE0, ST}.
In the case of using a real branch, latency per use is likely to
be 2-cycles, moderated by typical branch prediction. The condition
will have resolved early, so we are simply FETCH/DECODE/TAKE bound.
{{That is: PRED should be faster in almost all conceivable cases.}}
Okay, that all makes sense. Thanks for the detailed
explanation. I agree it's very slick; is this documented
somewhere publicly? If not, why not?
Generally when I write queue-code, I use a dummy Node front/rear
such that the checks for Null are unnecessary (at the cost of
following 1 more ->next or ->prev). That is Q->head and Q->tail
are never NULL and when the queue is empty there is a Node which
carries the fact the queue is empty (not using NULL as a pointer).
But that is just my style.
Ok.
- Dan C.
On Sat, 29 Mar 2025 14:28:02 +0000, Dan Cross wrote:
In article <cb049d5490b541878e264cedf95168e1@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Fri, 28 Mar 2025 2:53:57 +0000, Dan Cross wrote:
[snip]
What I really wanted was an example that exceeded that limit as
an expository vehicle for understanding what happens when the
limit is exceeded. What does the hardware do in that case?
Raise OPERATION (instruction) Fault.
Ok, very good. The software implications could be profound.
[snip]---------
Doesn't this code now assume that `an->prev` is in the same
cache line as `an`, and similarly that `bn` is in the same line
as `bn->prev`? Absent an alignment constraint on the `Node`
type, that's not guaranteed given the definition posted earlier.
Everything you state is true, I just tried to move the esmLOCKxxxx
up to the setup phase to obey ESM rules.
Yeah, that makes sense. Describing it as a rule, however,
raises the question of whether one must touch a line before a
store to a different line, or just before a store to that line?
Right from the spec::
"There is an order of instructions imposed upon software <i.e.,
compiler> where:
• all participating inbound memory reference instructions shall be >performed prior to manifestation;
• non-participating inbound memory reference instructions should be >performed prior to manifestation;
• the <optional> query instruction denotes the boundary between setup >and manifestation;
• the first Outbound to a participating cache line <also> begins >manifestation
• the only Outbound with Lock bit set (L ≡ 1) completes the event
The processor monitors the instruction sequence of the ATOMIC event and
will raise the OPERATION exception when the imposed order is violated."
Basically, all participating page-faults happen prior to any
modification
attempts.
That may be an odd and ill-advised thing for a programmer to do
if they want their list type to work with atomic events, but it
is possible.
The node pointers and their respective `next` pointers are okay,
so I wonder if perhaps this might have been written as:
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
hp = esmLOCKload(*head);
esmLOCKprefetch(an = esmLOCKload(a->next));
ap = esmLOCKload(a->prev);
esmLOCKprefetch(bn = esmLOCKload(b->next));
bp = esmLOCKload(b->prev);
if (an != NULL) // I see what you did
esmLOCKprefetch(an->prev);
if (bn != NULL) {
esmLOCKprefetch(bn->prev);
bn->prev = a;
}
if (hp == a)
*head = b;
else if (hp == b)
*head = a;
b->next = an;
if (an != NULL)
an->prev = b;
b->prev = ap;
if (ap != NULL)
ap->next = b;
// illustrative code
a->next = bn; // ST Rbp,[Ra,#next]
if (bp != NULL) // PNE0 Rbp,T
bp->next = a; // ST Ra,[Rbp,#next]
esmLOCKstore(a->prev, bp);
}
But now the conditional testing whether or not `an` is `NULL` is
repeated. Is the additional branch overhead worth it here?
In My 66000 ISA, a compare against zero (or NULL) is just a branch >>>instruction, so the CMP zero is performed twice, but each use is
but a single Branch-on-condition instruction (or Predicate-on-
Condition instruction).
In the case of using predicates, FETCH/DECODE will simple issue
both then and else clauses into the execution window (else-clause
is empty here) and let the reservation stations handle execution
order. And the condition latency is purely the register dependence
chain. A 6-wide machine should have no trouble in inserting two
copies of the code commented by "illustrative code" above--in
this case 6-instructions or 2 sets of {ST, PNE0, ST}.
In the case of using a real branch, latency per use is likely to
be 2-cycles, moderated by typical branch prediction. The condition
will have resolved early, so we are simply FETCH/DECODE/TAKE bound.
{{That is: PRED should be faster in almost all conceivable cases.}}
Okay, that all makes sense. Thanks for the detailed
explanation. I agree it's very slick; is this documented
somewhere publicly? If not, why not?
Documented: Yes 21 pages.
Public: I keep holding back as if I were to attempt to patent certain >aspects. But I don't seem to have the energy to do such. I as if you
wanted to see it.
In article <1b9a9644c3f5cbd2985b89443041e01a@www.novabbs.org>,------------
MitchAlsup1 <mitchalsup@aol.com> wrote:
Okay, that all makes sense. Thanks for the detailed
explanation. I agree it's very slick; is this documented
somewhere publicly? If not, why not?
Documented: Yes 21 pages.
Public: I keep holding back as if I were to attempt to patent certain >>aspects. But I don't seem to have the energy to do such. I as if you
wanted to see it.
(I assume you meant "ask if you want to see it"?)
Sure, I think it would be interesting. Should I send you an
email?
- Dan C.
In article <t3CFP.529522$f81.217510@fx48.iad>,
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:
In article <8uzEP.115550$Xq5f.66883@fx38.iad>,It took me a while to find copy of that book online.
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:I'm not sure what you're saying here, or rather, how it is
In article <kVgEP.1277108$_N6e.605199@fx17.iad>,I'm using terminology in common use on multiprocessor systems like
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:A spin lock is simply any lock where you spin trying to acquire
[snip]Terminology: mutexes coordinate mutual exclusion between threads,
Consider a thread that takes a spinlock; suppose some
high-priority interrupt comes in while the thread is holding
that lock. In response to the interrupt, software decides to
suspend the thread and switch some other thread; that thread
wants to lock the spin lock that the now-descheduled thread is
holding: a classic deadlock scenario.
spinlocks coordinate mutual exclusion between cpu cores.
Windows "critical sections" are mutexes with a fast path.
the lock, as opposed to a blocking synchronization protocol.
Here I'm using the terminology of Herlihy and Shavit [Her08].
VMS since 1980's and later WinNT.
Each OS data structure has different reentrancy requirements.
- Ones accessed by threads are guarded by mutexes.
Example on Windows would be the page tables as paging is done by threads.
- Ones accessed by OS software interrupt level are guarded by spinlocks. >>>> Example on Windows is the Scheduler tables.
- Ones accessed by HW interrupts are guarded by interrupt spinlocks.
Example on Windows is device interrupt service routines.
relevant.
Your earlier statement was that "mutexes coordinate mutual
exclusion between threads, spinlocks coordinate mutual exclusion
between cpu cores." But this is not the case.
Taking VMS as an example, from [Gol94], sec 9.5 (page 255):
|A thread of execution running on one processor acquires a
|spinlock to serialize access to the data with threads of
|execution running on other processors. Before acquiring
|the spinlock, the thread of executions raises IPL to block
|accesses by other threads of execution running on the same
|processor. The IPL value is determined by the spinlock being
|locked.
Goldberg and Saravanan seems quite explicit that threads can
acquire and hold spinlocks on OpenVMS, which seems to be
something that you were objecting to?
That 1994 book relates to VMS before they added kernel threads to the OS.
Before you said you were using terminology commonly used on
systems "since [the] 1980s". The systems you cited were VMS and
Windows. I quoted references about both systems from the mid
1990s and mid 2000s, respectively. If that terminology were
common in the 1980s, as you suggested, surely it would be in a
book from 1994? Regardless, see below.
At that point VMS scheduled processes, an execution context plus virtual
space, like most *nix (except I think Solaris which had kernel threads).
It says on pg 3: "The term thread of execution as used in this book
refers to a computational entity, an agent of execution."
This refers generally to the various code sequences from different context >> that may be intertwined on a processors core for interrupts, exceptions,
Dpc (called driver forks in VMS), process AST's (software interrupts),
and applications.
In a later book on VMS after they added kernel scheduled threads they
consistently use the two terms "thread of execution" and "kernel thread"
to distinguish the two meanings.
Are you referring to the 1997 revision of Goldenberg et al,
that covers VMS 7.0? [Gol97] In that context, VMS "kernel
threads" refer to threads of execution within a program that
are managed and scheduled by the VMS executive (aka kernel).
This is described on page 151, section 5.1:
|A thread of execution is a sequential flow of instruction
|executiuon. A traditional program is single-threaded,
|consisting of one main thread of execution. In contrast, a
|multithreaded program divides its work among multiple threads
|that can run concurrently. OpenVMS supplies DECthreads, a
|run-time library, to facilitate the creation, synchronization,
|and deletion of multiple user-mode thredas.
|
|OpenVMS Alpha Version 7.0 supports multiple execution contexts
|within a process: each execution context has its own hardware
|context and stacks, and canexecute independently of the others;
|all execution context with a process share the same virtual
|address space. The kernel _kernel thread_ refers to one of
|those execution contexts and the thread of execution running
|within it. This volume uses the term _multithreaded_ to refer
|to a process with multiple kernel threads.
|
|The kernel thread is now the basis for OpenVMS scheduling.
|[...]
It continues on page 152:
|The term _kernel thread_ is a somewhat confusing name, in that
|instructions executing in a kernel thread can run in any of the
|four access modes, not only kernel mode. The term arose to
|distinguish the current multithreaded model, supported within
|the OpenVMS executive (or kernel), from the previous model,
|supported only by DECthreads.
Sadly, the later books on OpenVMS internals are more focused on
specific subsystems (scheduling and process management; memory
management) than the ealier volumes. Details of the
implementation of synchronization primitives are almost entirely
missing.
There is discussion about how those primitives are _used_,
however. [Gol03] mentions spinlocks, such as the SCHED
spinlock, in the context of memory management data structures.
E.g., on page 51:
|The swappability of the PHD [Process Header] result in several
|different methods for synchronizing access to fields within it.
|Because a PHD can be inswapped to a different balance set slot
|than it last occuped, accesses to a PHD that use its system
|space address must be synchronized against swapper
|interference. Accesses from a kernel thread to its own PHD can
|be made with the SCHED spinlock held to block any rescheduling
|and possible swapping of the process. Holding the MMG spinlock
|is another way to block swapping.
Notably here, that the language is again in terms of (software)
threads. It seems clear enough that in the VMS context,
spinlocks are conceptually held by threads, not cores.
Dan Cross wrote:
In article <t3CFP.529522$f81.217510@fx48.iad>,
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:Before you said you were using terminology commonly used on
In article <8uzEP.115550$Xq5f.66883@fx38.iad>,It took me a while to find copy of that book online.
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:I'm not sure what you're saying here, or rather, how it is
In article <kVgEP.1277108$_N6e.605199@fx17.iad>,I'm using terminology in common use on multiprocessor systems like
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Dan Cross wrote:A spin lock is simply any lock where you spin trying to acquire
[snip]Terminology: mutexes coordinate mutual exclusion between threads, >>>>>>> spinlocks coordinate mutual exclusion between cpu cores.
Consider a thread that takes a spinlock; suppose some
high-priority interrupt comes in while the thread is holding
that lock. In response to the interrupt, software decides to
suspend the thread and switch some other thread; that thread
wants to lock the spin lock that the now-descheduled thread is >>>>>>>> holding: a classic deadlock scenario.
Windows "critical sections" are mutexes with a fast path.
the lock, as opposed to a blocking synchronization protocol.
Here I'm using the terminology of Herlihy and Shavit [Her08].
VMS since 1980's and later WinNT.
Each OS data structure has different reentrancy requirements.
- Ones accessed by threads are guarded by mutexes.
Example on Windows would be the page tables as paging is done by threads.
- Ones accessed by OS software interrupt level are guarded by spinlocks. >>>>> Example on Windows is the Scheduler tables.
- Ones accessed by HW interrupts are guarded by interrupt spinlocks. >>>>> Example on Windows is device interrupt service routines.
relevant.
Your earlier statement was that "mutexes coordinate mutual
exclusion between threads, spinlocks coordinate mutual exclusion
between cpu cores." But this is not the case.
Taking VMS as an example, from [Gol94], sec 9.5 (page 255):
|A thread of execution running on one processor acquires a
|spinlock to serialize access to the data with threads of
|execution running on other processors. Before acquiring
|the spinlock, the thread of executions raises IPL to block
|accesses by other threads of execution running on the same
|processor. The IPL value is determined by the spinlock being
|locked.
Goldberg and Saravanan seems quite explicit that threads can
acquire and hold spinlocks on OpenVMS, which seems to be
something that you were objecting to?
That 1994 book relates to VMS before they added kernel threads to the OS. >>
systems "since [the] 1980s". The systems you cited were VMS and
Windows. I quoted references about both systems from the mid
1990s and mid 2000s, respectively. If that terminology were
common in the 1980s, as you suggested, surely it would be in a
book from 1994? Regardless, see below.
SMP OS (synchronized with spinlocks) was added to VMS is 1988.
VMS clusters were released in 1983 with its locking system and
cluster read-write locks can be used to build other synchronization >structures like mutexes, semaphores, quorums.
I don't know when the term "thread" meaning "scheduled units of execution >within a common address space" first show up. I found a reference to it on >Unix in 1989. Previously the term "task" was used to mean the same thing, >such as Ada tasks which on DEC Ada in 1986 were similar to what are now >called user mode threads. It looks to me that the DEC Ada runtime library
was later morphed into DECThreads user mode library released in 1991.
At that point VMS scheduled processes, an execution context plus virtual >>> space, like most *nix (except I think Solaris which had kernel threads). >>>
It says on pg 3: "The term thread of execution as used in this book
refers to a computational entity, an agent of execution."
This refers generally to the various code sequences from different context >>> that may be intertwined on a processors core for interrupts, exceptions, >>> Dpc (called driver forks in VMS), process AST's (software interrupts),
and applications.
In a later book on VMS after they added kernel scheduled threads they
consistently use the two terms "thread of execution" and "kernel thread" >>> to distinguish the two meanings.
Are you referring to the 1997 revision of Goldenberg et al,
that covers VMS 7.0? [Gol97] In that context, VMS "kernel
threads" refer to threads of execution within a program that
are managed and scheduled by the VMS executive (aka kernel).
I saw reference to 1997 internals book and that kernel threads were added. >However the only other OpenVMS internals document I could find was
OpenVMS Alpha Internals and Data Structures - Memory Management,
Goldenberg et al, 2003.
This is described on page 151, section 5.1:
|A thread of execution is a sequential flow of instruction
|executiuon. A traditional program is single-threaded,
|consisting of one main thread of execution. In contrast, a
|multithreaded program divides its work among multiple threads
|that can run concurrently. OpenVMS supplies DECthreads, a
|run-time library, to facilitate the creation, synchronization,
|and deletion of multiple user-mode thredas.
|
|OpenVMS Alpha Version 7.0 supports multiple execution contexts
|within a process: each execution context has its own hardware
|context and stacks, and canexecute independently of the others;
|all execution context with a process share the same virtual
|address space. The kernel _kernel thread_ refers to one of
|those execution contexts and the thread of execution running
|within it. This volume uses the term _multithreaded_ to refer
|to a process with multiple kernel threads.
|
|The kernel thread is now the basis for OpenVMS scheduling.
|[...]
This "kernel thread" is the definition I have been using:
unit of execution *scheduled* by the operating system.
It continues on page 152:
|The term _kernel thread_ is a somewhat confusing name, in that
|instructions executing in a kernel thread can run in any of the
|four access modes, not only kernel mode. The term arose to
|distinguish the current multithreaded model, supported within
|the OpenVMS executive (or kernel), from the previous model,
|supported only by DECthreads.
Sadly, the later books on OpenVMS internals are more focused on
specific subsystems (scheduling and process management; memory
management) than the ealier volumes. Details of the
implementation of synchronization primitives are almost entirely
missing.
There is discussion about how those primitives are _used_,
however. [Gol03] mentions spinlocks, such as the SCHED
spinlock, in the context of memory management data structures.
E.g., on page 51:
|The swappability of the PHD [Process Header] result in several
|different methods for synchronizing access to fields within it.
|Because a PHD can be inswapped to a different balance set slot
|than it last occuped, accesses to a PHD that use its system
|space address must be synchronized against swapper
|interference. Accesses from a kernel thread to its own PHD can
|be made with the SCHED spinlock held to block any rescheduling
|and possible swapping of the process. Holding the MMG spinlock
|is another way to block swapping.
Notably here, that the language is again in terms of (software)
threads. It seems clear enough that in the VMS context,
spinlocks are conceptually held by threads, not cores.
Also Gol91 VAX VMS Internals and Data Structures v5.2 1991
All three, Gol91, Gol94, Rus12 have a section on spinlocks and
explicitly say spinlocks synchronize processors, not threads.
Gol91 pg 172, Gol94 pg 254: "A spinlock is acquired by a processor to >synchronize access to data shared by members of an SMP system."
Rus12 pg 179: "The mechanism the kernel uses to achieve multiprocessor
mutual exclusion is called a spinlock."
Furthermore a search for the word "spinlock " in all three documents
finds repeated references to "processor spinlock". There are NO usages
of "process spinlock" on VMS or "thread spinlock" on WinNT.
It is mutexes that synchronize VMS processes or WNT threads.
The difference is spinlocks do not invoke the scheduler to trigger
an executable context switch, whereas mutexes may invoke the scheduler.
There is a section on mutexes in
Gol91 pg 196, Gol04 pg 277, Rus12 pg 183 Low-IRQL Synchronization
Gol91 pg 25, Gol94 pg 29
"Thus, the VMS executive requires a third synchronization tool to allow >synchronized access to pageable data structures. This tool must also allow
a process to be removed from execution while it maintains ownership of the >structure in question. One synchronization tool that fulfills these >requirements is called a mutual exclusion semaphore (mutex)."
Gol91 pg 196, Gol94 pg 277
"However, in some situations requiring synchronization, elevated IPL is an >unacceptable technique."
...
"Typically, the threads of execution whose accesses are synchronized
through a mutex are process context threads."
Rus12 pg 183 Low-IRQL Synchronization
"Because waiting for a spinlock literally stalls a processor, spinlocks
can be used only under the following strictly limited circumstances"
"There are several additional synchronization mechanisms for use when >spinlocks are not suitable:
- Kernel dispatcher objects
- Fast mutexes and guarded mutexes
- Pushlocks
- Executive resources
Additionally, user-mode code, which also executes at low IRQL, must be able >to have its own locking primitives. Windows supports various >user-mode-specific primitives:
- Condition variables (CondVars)
- Slim Reader-Writer Locks (SRW Locks)
- Run-once initialization (InitOnce)
- Critical sections"
These books consistently refer to spinlocks synchronize *processors*,
aka cores, at raised IPL (>= dispatch), and mutexes synchronize scheduled >processes/threads at lower IPL (< dispatch).
The reason is that when a kernel thread, executing a system call,
raises the processor >= dispatch, which is required to acquire a spinlock,
it stops being a scheduled thread, and may have temporarily switched
kernel stacks as processor interrupt levels require their own stacks.
At raised IPL many of the OS data structures associated with a kernel thread >cannot be touched as they might trigger a page fault which can only be >serviced by a kernel thread which can sleep, and not by processors/cores >which can only spin-wait. (Both VMS and WNT will crash/panic if code
page faults at IPL >= dispatch for exactly this reason.)
All of this is for controlling access to non-reentrant shared data. >Non-reentrant data shared between processors/cores is controlled by
IPL and blocked by spinlocks. Non-reentrant data shared between kernel >threads is controlled with scheduler synchronization and thread switching.
In article <tWhGP.1790040$TBhc.304536@fx16.iad>,
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Also Gol91 VAX VMS Internals and Data Structures v5.2 1991
All three, Gol91, Gol94, Rus12 have a section on spinlocks and
explicitly say spinlocks synchronize processors, not threads.
Gol91 pg 172, Gol94 pg 254: "A spinlock is acquired by a processor to
synchronize access to data shared by members of an SMP system."
Also [Gol91], page 173:
|A resource synchronized through elevated IPL on a uniprocessor
|is synchronzied through a combination of spinlock and elevated
|IPL on an SMP system. *A thread of execution* running on one
|processor acquires a spinlock to serialized access to the data
|with *threads of execution* running on other processors.
|Before acquiring, the _thread of execution_ raises IPL to block
|accesses *by other threads of execution* running on the *same*
|processor.
Emphasis added. Note that the primitives that do the acquiring
and holding are threads.
(Note, I already quoted this text from [Gol94] in an earlier
response; page 255.)
Rus12 pg 179: "The mechanism the kernel uses to achieve multiprocessor
mutual exclusion is called a spinlock."
Yes, I already pointed out that this is the usual case. But you
are still ignoring the uniprocessor case: spinlocks must work
there, as well.
But more relevantly, this does not imply that _processors_ "own"
spinlocks in the way you have asserted. That is done by
threads. A thread does not magically "turn into" a processor
when it acquires a spinlock; it remains a thread, albeit one
that is now holding a spinlock that it wasn't before.
Dan Cross wrote:
In article <tWhGP.1790040$TBhc.304536@fx16.iad>,
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Also Gol91 VAX VMS Internals and Data Structures v5.2 1991
All three, Gol91, Gol94, Rus12 have a section on spinlocks and
explicitly say spinlocks synchronize processors, not threads.
Gol91 pg 172, Gol94 pg 254: "A spinlock is acquired by a processor to
synchronize access to data shared by members of an SMP system."
Also [Gol91], page 173:
|A resource synchronized through elevated IPL on a uniprocessor
|is synchronzied through a combination of spinlock and elevated
|IPL on an SMP system. *A thread of execution* running on one
|processor acquires a spinlock to serialized access to the data
|with *threads of execution* running on other processors.
|Before acquiring, the _thread of execution_ raises IPL to block
|accesses *by other threads of execution* running on the *same*
|processor.
Emphasis added. Note that the primitives that do the acquiring
and holding are threads.
(Note, I already quoted this text from [Gol94] in an earlier
response; page 255.)
And I've tried to get you to stop treating the terms "thread of execution" >and "kernel thread" as equivalent. They are not.
It says "thread of execution" which as I have explained is not a
necessarily a kernel thread (or in this case a VMS process).
It could be an interrupt or a Dpc, or code triggered as a consequence
of actions of any of those which has no connection to whichever
random kernel thread was running on a processor at that time.
All of those "threads of execution" could acquire a spinlock at their >associated IPL but only one of those is due to the current kernel thread.
Even if that was a kernel thread that raised its IPL and performed
some kernel action, when it tries to lower its IPL < dispatch,
that can trigger the Dispatcher which may invoke the Scheduler
and ThreadSwitcher. As that transition into Dispatcher might trigger
a thread switch, it conceptually[1] dumps its thread register context
in case a switch is needed.
And all of those consequences which were not directly invoked by the
current kernel thread but were triggered by it could acquire spinlocks.
Those triggered "threads of execution" were set up by other
"threads of execution", some of which may be kernel threads,
but are unrelated to the currently executing kernel thread.
On entry to the Dispatcher, the register set, the thread of execution, >conceptually[1] is no longer connected to whichever kernel thread
was running on the processor. The private processor state information >maintains a reference to its current kernel thread whose context will be >restored when it exits the Dispatcher.
[1] I say conceptually because with careful layering of the code for
entering and leaving the Dispatcher/Scheduler/ThreadSwitcher the OS
can defer saving some of the kernel thread's registers into the kernel
thread context until the actual thread switch occurs, as long as it
does not modify those registers. x64 FPU or SIMD registers might be examples.
Rus12 pg 179: "The mechanism the kernel uses to achieve multiprocessor
mutual exclusion is called a spinlock."
Yes, I already pointed out that this is the usual case. But you
are still ignoring the uniprocessor case: spinlocks must work
there, as well.
But more relevantly, this does not imply that _processors_ "own"
spinlocks in the way you have asserted. That is done by
threads. A thread does not magically "turn into" a processor
when it acquires a spinlock; it remains a thread, albeit one
that is now holding a spinlock that it wasn't before.
I've pointed you at the sections of those books which repeatedly refers to >them as "processor spinlocks".
Rus12 on pg 181 even shows a WNT kernel
debugger command in "EXPERIMENT: Viewing Global Queued Spinlocks" which
*displays a list of queued spinlocks held by each processor*.
That list is stored in the Processor Control Region PCR,
a per-processor status block WNT maintains.
If that is not enough to convince you, nothing will.
I've tried my best to explain it and pointed you to the text that supports >this but you remain unconvinced and this is just going around in circles
and using up both of our time. As your interpretation works for you and
mine works for me, we'll just have to agree to disagree.