• Re: MSI interrupts

    From MitchAlsup1@21:1/5 to Robert Finch on Sun Mar 16 22:00:41 2025
    On Sun, 16 Mar 2025 21:35:29 +0000, Robert Finch wrote:

    On 2025-03-15 6:28 p.m., MitchAlsup1 wrote:
    On Sat, 15 Mar 2025 20:46:41 +0000, Robert Finch wrote:

    On 2025-03-15 1:46 p.m., MitchAlsup1 wrote:
    On Sat, 15 Mar 2025 2:10:48 +0000, Robert Finch wrote:
    ------------
    The SoC system is limited to 64 CPU cores by the use of a 6-bit core
    number. So, a 64-bit vector containing which cores to send the interrupt >>> to could be used. A find-first-one responding could be used to mask
    things so that only one CPU sees the interrupt.

    Consider that you have a pool of 4 cores setup to receive interrupts.
    That those 4 cores are running at differing priorities and the interrupt
    is at a still different priority. You want the core operating at the
    lowest
    priority (with the right software stack) to accept the interrupt !!

    Okay, I wrote a naive hardware filter for this which should work okay
    for small numbers of CPUs but does not scale up very well.
    What happens if there is no core ready? Place the IRQ back into the
    queue (regenerate the IRQ as an IRQ miss IRQ)? Or just wait for an
    available core. I do not like the idea of waiting as it could stall the system.

    I can't talk about this for another week or so.

    To identify the CPU pool, with a max of 63 CPU per interrupt controller,

    Terminology:: My 66000 has 1 interrupt controller (per chip) that can
    process MSI-X messages for an unbounded number of interrupt tables and
    an unbounded number of cores. Each interrupt table is associated with a
    Guest OS (VM) or HyprrVisor (VMM) or ...

    The only limitation is the size of the interrupt aperture--

    OH, and BTW, I use the word aperture rather than range as the range
    may be sparse with unbounded number of holes of unbounded length in
    the range {begin..end}. And the service port only services things
    in the aperture not in necessarily the range.

    the pool could be represented directly as a 64-bit word in the interrupt vector table. I have been thinking of using a 256-bit vector entry
    instead of 128-bits. I allocated 96-bit for an instruction / address
    field.

    I think you are drifting towards "affinity" (and a flat version of that itself). Not a wrong way to go, but fraught with peril.

    You might be better off heading into a vector of 64-bit containers
    which specify participation in one or more affinity groups per core.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Mar 23 16:48:58 2025
    On Sun, 23 Mar 2025 16:16:16 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 20 Mar 2025 12:50:09 +0000, Dan Cross wrote:

    In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    But let me turn it around, what real-world problem does this
    interrupt prioritization scheme solve coupled with the inability
    to disable interrupts solve?

    The OS does not have to "walk" its schedule queues in order to
    context switch.

    The SVR (SuperVisor Return) instruction checks the priority of
    the thread that had been interrupted, and if it is of lower
    priority than what is pending in the interrupt table, the
    one on the interrupt table receives control while the lower
    priority thread remains interrupted.

    Thus, the return path from an ISR, just does an SVR. If a
    DPC/softIRQ has been setup, or another interrupt has become
    pending, HW chooses the path--acting as if HW returned control
    to the interrupted thread, and then immediately took the inter-
    rupt of highest priority still pending--without having executed
    a single instruction in the interrupted thread.

    Is anyone complaining that the
    ability to disable interrupts is somehow a serious burden?

    This thread started with the notion of MSI interrupts and the
    inherent need to DI upon arrival at ISR dispatcher. This sub-
    thread is investigating whether one can perform ISRs without
    disabling interrupts. Thus, making IDisablement an unnecessary
    part of interrupt servicing itself.

    Your (and EricP)'s contribution is to annotate those instances
    where one can arrive at an ISR IE and still want/need to DI and EI
    independent of the control transfer from and back to interrupted
    thread (even if control arrives reentrantly from interrupted thread).

    Supervisor return, whatever it is called - sysret, rei, svr -
    is traditionally another point where a number of OS race conditions
    can force it to briefly disable interrupts.

    The classic case is when a DPC/SoftIrq may be posted by an Interrupt
    Service Routine (ISR) and causes the First Level Interrupt Handler
    (FLIH),
    which is the ISR nested first on the stack of priority interrupts,
    to exit into the OS rather than return to its prior thread.

    The race condition occurs between the point when the FLIH ISR epilogue
    checks for pending a pending DPC, and not finding one decides to return
    to the prior thread. Between that check and the SVR instruction a higher priority interrupt occurs and posts a DPC.

    I remove that race condition by removing the need for FLIR (which I
    call the dispatcher) to check. All the checks are done AT* the SVR
    instruction. (*) and during

    Another race condition can be how does an epilogue know if it is a FLIH
    or nested one? A counter for each core incremented by the ISR prologue
    and checked by the epilogue would do but what if the nested interrupt
    catches a lower level prologue just before it increments the counter.

    How to know:: It COULD look at the value in its stack pointer. An
    empty stack implies it is leaving, a non-empty stack implies it
    has more to do.

    But in practice, it does not need to know as interrupts are properly
    nested and serviced in priority order.

    There are a number of non-interruptible checks and control flag changes
    that supervisor return must perform.

    To do this I have a single instruction CSVR Conditional Supervisor
    Return
    which handles all interrupt, exception, and SysCall returns. It performs
    a model specific test on an interrupt control register and if True
    changes
    the ICR state and returns to the prior context. If False makes other ICR state changes and falls through the CSVR instruction where code can
    determine what just changed. This would typically be a jump into the OS Dispatcher to flush any DPC/SoftIrq and check for thread rescheduling.
    This whole return sequence is performed without having to disable
    interrupts.

    I, too, have a single instruction (SVR) that delivers control to the
    highest priority still pending thread, regardless of whether it is
    a new interrupt, a scheduled DPC/softIRQ, a local timer, an IPI, or
    the thread the interrupt stack was placed upon.

    SVR looks at the priority of the thread to receive control, and the
    priorities of pending 'interrupts' (of which DPC/softIRQ and IPIs
    are part) and transfers control to the highest priority pending thread--occasionally that thread is the interrupted one.

    Should a higher priority thread become pending while SVR is in progress,
    the higher priority thread will receive control before the first inst-
    ruction of the mis-receiving thread is performed.

    As far as you have explained these things to me, I see no race
    conditions
    in the architected means.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Mar 23 12:16:16 2025
    MitchAlsup1 wrote:
    On Thu, 20 Mar 2025 12:50:09 +0000, Dan Cross wrote:

    In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    But let me turn it around, what real-world problem does this
    interrupt prioritization scheme solve coupled with the inability
    to disable interrupts solve?

    The OS does not have to "walk" its schedule queues in order to
    context switch.

    The SVR (SuperVisor Return) instruction checks the priority of
    the thread that had been interrupted, and if it is of lower
    priority than what is pending in the interrupt table, the
    one on the interrupt table receives control while the lower
    priority thread remains interrupted.

    Thus, the return path from an ISR, just does an SVR. If a
    DPC/softIRQ has been setup, or another interrupt has become
    pending, HW chooses the path--acting as if HW returned control
    to the interrupted thread, and then immediately took the inter-
    rupt of highest priority still pending--without having executed
    a single instruction in the interrupted thread.

    Is anyone complaining that the
    ability to disable interrupts is somehow a serious burden?

    This thread started with the notion of MSI interrupts and the
    inherent need to DI upon arrival at ISR dispatcher. This sub-
    thread is investigating whether one can perform ISRs without
    disabling interrupts. Thus, making IDisablement an unnecessary
    part of interrupt servicing itself.

    Your (and EricP)'s contribution is to annotate those instances
    where one can arrive at an ISR IE and still want/need to DI and EI independent of the control transfer from and back to interrupted
    thread (even if control arrives reentrantly from interrupted thread).

    Supervisor return, whatever it is called - sysret, rei, svr -
    is traditionally another point where a number of OS race conditions
    can force it to briefly disable interrupts.

    The classic case is when a DPC/SoftIrq may be posted by an Interrupt
    Service Routine (ISR) and causes the First Level Interrupt Handler (FLIH), which is the ISR nested first on the stack of priority interrupts,
    to exit into the OS rather than return to its prior thread.

    The race condition occurs between the point when the FLIH ISR epilogue
    checks for pending a pending DPC, and not finding one decides to return
    to the prior thread. Between that check and the SVR instruction a higher priority interrupt occurs and posts a DPC.

    Another race condition can be how does an epilogue know if it is a FLIH
    or nested one? A counter for each core incremented by the ISR prologue
    and checked by the epilogue would do but what if the nested interrupt
    catches a lower level prologue just before it increments the counter.

    There are a number of non-interruptible checks and control flag changes
    that supervisor return must perform.

    To do this I have a single instruction CSVR Conditional Supervisor Return
    which handles all interrupt, exception, and SysCall returns. It performs
    a model specific test on an interrupt control register and if True changes
    the ICR state and returns to the prior context. If False makes other ICR
    state changes and falls through the CSVR instruction where code can
    determine what just changed. This would typically be a jump into the OS Dispatcher to flush any DPC/SoftIrq and check for thread rescheduling.
    This whole return sequence is performed without having to disable interrupts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Mon Mar 17 13:38:12 2025
    Robert Finch <robfi680@gmail.com> writes:

    <please trim posts>


    Consider that you have a pool of 4 cores setup to receive interrupts.
    That those 4 cores are running at differing priorities and the interrupt
    is at a still different priority. You want the core operating at the
    lowest
    priority (with the right software stack) to accept the interrupt !!

    Okay, I wrote a naive hardware filter for this which should work okay
    for small numbers of CPUs but does not scale up very well.
    What happens if there is no core ready? Place the IRQ back into the
    queue (regenerate the IRQ as an IRQ miss IRQ)? Or just wait for an
    available core. I do not like the idea of waiting as it could stall the >system.

    Use a fifo to hold up to N pending IRQ. Define a signal that asserts when
    the fifo is non-empty. The CPU can mask the signal to prevent
    interruption, when the signal is unmasked the CPU pops the
    first IRQ from the FIFO. Or use a bitmap in flops or SRAM
    (prioritization happensin the logic that asserts the fact
    hat an interrupt is pending to the CPU).

    Choose whether you want a FIFO/bitmap per target CPU, or global
    pending data with the logic to select highest priority pending
    IRQ when the CPU signals that interrups are not masked.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Mon Mar 17 16:11:32 2025
    On Mon, 17 Mar 2025 13:38:12 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Robert Finch <robfi680@gmail.com> writes:

    <please trim posts>


    Consider that you have a pool of 4 cores setup to receive
    interrupts. That those 4 cores are running at differing priorities
    and the interrupt is at a still different priority. You want the
    core operating at the lowest
    priority (with the right software stack) to accept the interrupt
    !!

    Okay, I wrote a naive hardware filter for this which should work
    okay for small numbers of CPUs but does not scale up very well.
    What happens if there is no core ready? Place the IRQ back into the
    queue (regenerate the IRQ as an IRQ miss IRQ)? Or just wait for an >available core. I do not like the idea of waiting as it could stall
    the system.

    Use a fifo to hold up to N pending IRQ. Define a signal that asserts
    when the fifo is non-empty. The CPU can mask the signal to prevent interruption, when the signal is unmasked the CPU pops the
    first IRQ from the FIFO. Or use a bitmap in flops or SRAM
    (prioritization happensin the logic that asserts the fact
    hat an interrupt is pending to the CPU).

    Choose whether you want a FIFO/bitmap per target CPU, or global
    pending data with the logic to select highest priority pending
    IRQ when the CPU signals that interrups are not masked.

    The problem Robert is talking about arises when there are many
    interrupt source and many target CPUs.
    The required routing/prioritization/acknowledgment logic (at least a
    naive logic I am having in mind) would be either non-scalable or
    relatively complicated. Process of selection for the second case will
    take multiple cycles (I am thinking about ring).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Mon Mar 17 15:44:30 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 17 Mar 2025 13:38:12 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Robert Finch <robfi680@gmail.com> writes:

    <please trim posts>


    Consider that you have a pool of 4 cores setup to receive
    interrupts. That those 4 cores are running at differing priorities
    and the interrupt is at a still different priority. You want the
    core operating at the lowest
    priority (with the right software stack) to accept the interrupt
    !!

    Okay, I wrote a naive hardware filter for this which should work
    okay for small numbers of CPUs but does not scale up very well.
    What happens if there is no core ready? Place the IRQ back into the
    queue (regenerate the IRQ as an IRQ miss IRQ)? Or just wait for an
    available core. I do not like the idea of waiting as it could stall
    the system.

    Use a fifo to hold up to N pending IRQ. Define a signal that asserts
    when the fifo is non-empty. The CPU can mask the signal to prevent
    interruption, when the signal is unmasked the CPU pops the
    first IRQ from the FIFO. Or use a bitmap in flops or SRAM
    (prioritization happensin the logic that asserts the fact
    hat an interrupt is pending to the CPU).

    Choose whether you want a FIFO/bitmap per target CPU, or global
    pending data with the logic to select highest priority pending
    IRQ when the CPU signals that interrups are not masked.

    The problem Robert is talking about arises when there are many
    interrupt source and many target CPUs.

    I work with socs with many (> 64) CPUs and several hundred
    interrupt sources, with 10s to 100s of individual vectors
    per source. We use the ARM GICv3 as the interrupt
    controller, which provides the necessary logic to support
    individual enables, priorities, security state and target
    routing information for each IRQ. It's a fairly complex IP block when
    support for virtualization is included. Many routing
    and configuration tables are kept in DRAM allocated by the
    hypervisor or bare-metal OS and cached as necessary. There
    is a class of interrupt for which the configuration information
    is kept in flops (SPI) that can be used for a limited number
    of latency sensitive interrupts and for level-sensitive interrupts.

    The required routing/prioritization/acknowledgment logic (at least a
    naive logic I am having in mind) would be either non-scalable or
    relatively complicated.

    Indeed, to provide the full set of interrupt routing capabilities
    is not something that can be done with naive logic.

    For a scalable design, it's useful to consider how the logic
    will scale as the core counts and device counts increase with
    time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Mon Mar 17 15:53:42 2025
    Robert Finch <robfi680@gmail.com> writes:
    On 2025-03-17 9:38 a.m., Scott Lurndal wrote:
    Robert Finch <robfi680@gmail.com> writes:

    Been reading up on the RISCV AIA.

    For RISCV the CPU core handling the interrupt is indirectly specified by
    the I/O address set in the device as there is an IMSIC for each RISCV
    hart. So only a single IMSIC would be responding at that address. It
    would be up to software to set the I/O address to pick a particular CPU
    core. It seems to me that MSI interrupt handling using AIA would be >heavy-weight. Having a controller for every hart could be a lot of logic
    if there are a lot of harts. There may be more than one hart per CPU.

    Reading that the I/O addresses need to be translated by an IOMMU. In Q+
    since an interrupt controller number is being used instead of an I/O
    address there is no need to translate the address. Software needs to >determine which group of cores will be handling the interrupt, and
    select an interrupt controller appropriately.

    The Q+ controller looks a bit like an IMSIC controller as it has pending
    and enable bits. I note that there is an enable bit in the vector table
    and an enable bit array. So, I assume both bits must be enabled for an >interrupt to occur.



    Here's a copy of a slide I did a decade ago for ARM64 (GICv3) interrupts,
    not much has changed since.

    Interrupt Gates

    There are a number of gates that must be open to pass an interrupt
    from the source to the destination Core

    ¡ Device Interrupt Enable Register (e.g. A NIC receive interrupt enable)
    ¡ MSI-X Vector Enable (mask bit in each MSI-X vector control register)
    ¡ MSI-X PCI-Express Capability Enable (enable bit and function mask in MSI-X
    Capability message control field)
    ¡ The PCI CMD register must have bus mastering enabled
    ¡ GICv3 Distributor/Redistributor/LPI Properties per-interrupt enable bit
    ¡ GICv3 Group Enable in distributor (GICD_CTLR)
    ¡ GICv3 Group Enable in Target core (ICC_IGRPEN{01}_EL1)
    ¡ Priority Mask register and current running priority on target core
    ¡ ARM64 core interrupt signal Mask (PSTATE.A/I/F)

    Interrupt group, in this context, defines the security state of
    the interrupt (group 0: secure, group 1: non-secure) and is a
    configuration bit associated with each interrupt.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Mon Mar 17 14:33:09 2025
    Michael S wrote:
    On Mon, 17 Mar 2025 13:38:12 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Robert Finch <robfi680@gmail.com> writes:

    <please trim posts>

    Consider that you have a pool of 4 cores setup to receive
    interrupts. That those 4 cores are running at differing priorities
    and the interrupt is at a still different priority. You want the
    core operating at the lowest
    priority (with the right software stack) to accept the interrupt
    !!
    Okay, I wrote a naive hardware filter for this which should work
    okay for small numbers of CPUs but does not scale up very well.
    What happens if there is no core ready? Place the IRQ back into the
    queue (regenerate the IRQ as an IRQ miss IRQ)? Or just wait for an
    available core. I do not like the idea of waiting as it could stall
    the system.
    Use a fifo to hold up to N pending IRQ. Define a signal that asserts
    when the fifo is non-empty. The CPU can mask the signal to prevent
    interruption, when the signal is unmasked the CPU pops the
    first IRQ from the FIFO. Or use a bitmap in flops or SRAM
    (prioritization happensin the logic that asserts the fact
    hat an interrupt is pending to the CPU).

    Choose whether you want a FIFO/bitmap per target CPU, or global
    pending data with the logic to select highest priority pending
    IRQ when the CPU signals that interrups are not masked.

    The problem Robert is talking about arises when there are many
    interrupt source and many target CPUs.
    The required routing/prioritization/acknowledgment logic (at least a
    naive logic I am having in mind) would be either non-scalable or
    relatively complicated. Process of selection for the second case will
    take multiple cycles (I am thinking about ring).

    Another problem is what does the core do with the in flight instructions.

    Method 1 is simplest, it injects the interrupt request at Retire
    as that's where the state of everything is synchronized.
    The consequence is that, like exceptions, the in flight instructions all
    get purged, and we save the committed RIP, RSP and interrupt control word. While that might be acceptable for a 5 stage in-order pipeline,
    it could be pretty expensive for an OoO 200+ instruction queue
    potentially tossing hundreds of cycles of near finished work.

    Method 2 pipelines the switch by injecting the interrupt request at Fetch. Decode converts the request to a special uOp that travels down the IQ
    to Retire and allows all the older work to complete.
    This is more complex as it requires a two phase hand-off from the
    Interrupt Control Unit (ICU) to the core as a branch mispredict in the
    in flight instructions might cause a tentative interrupt acceptance to later
    be withdrawn.

    The ICU believes the core is in a state to accept a higher priority
    interrupt. It sends a request to core, which checks its current state and
    sends back an immediate INT_ACK if _might_ accept and stalls Fetch, or a NAK. When the special uOp reaches Retire, it sends a signal to Fetch which
    then sends an INT_ACCEPT signal to ICU to complete the handoff.
    If a branch mispredict occurs that causes interrupts to be disabled,
    then Fetch sends an INT_REJECT to ICU, and unstalls its fetching.
    (Yes that is not optimal - make it work first, make it work well second.)

    This also raises a question about what the ICU is doing during this
    long latency handoff. One wouldn't want ICU to sit idle so it might
    have to manage the handoff of multiple interrupts to multiple cores
    at the same time, each as its own little state machine.

    One should see that this decision on how the core handles the
    handoff has a large impact on the design complexity of the ICU.

    Personally, I would use method 1 first to get something working
    then if this is OoO think about getting fancy with method 2.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Mar 17 19:06:37 2025
    On Mon, 17 Mar 2025 15:53:42 +0000, Scott Lurndal wrote:

    Robert Finch <robfi680@gmail.com> writes:

    Here's a copy of a slide I did a decade ago for ARM64 (GICv3)
    interrupts, not much has changed since.

    Interrupt Gates

    There are a number of gates that must be open to pass an interrupt
    from the source to the destination Core

    Here are the enablement's in My 66000 {Borrowing Scott's list}

    � Device Interrupt Enable Register (e.g. A NIC receive interrupt
    enable)
    � MSI-X Vector Enable (mask bit in each MSI-X vector control
    register)
    � MSI-X PCI-Express Capability Enable (enable bit and function mask
    in
    MSI-X Capability message control field)
    � The PCI CMD register must have bus mastering enabled
    � Interrupt table must be enabled to receive this priority
    � Interrupt table must be enabled to deliver this priority
    � Core Priority must be lower than interrupt priority

    Although there is an interrupt enable bit in each core, when control
    arrives at dispatcher core is already re-entrant, at the proper SW
    stack level, and priority, with the old register file saved in memory
    and the register file of that SW stack level loaded. So, many/most
    taken interrupts do not need to disable interrupts even for a short
    period of time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon Mar 17 21:49:18 2025
    On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:

    Michael S wrote:
    On Mon, 17 Mar 2025 13:38:12 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:
    ------------------

    The problem Robert is talking about arises when there are many
    interrupt source and many target CPUs.
    The required routing/prioritization/acknowledgment logic (at least a
    naive logic I am having in mind) would be either non-scalable or
    relatively complicated. Process of selection for the second case will
    take multiple cycles (I am thinking about ring).

    Another problem is what does the core do with the in flight
    instructions.

    Method 1 is simplest, it injects the interrupt request at Retire
    as that's where the state of everything is synchronized.
    The consequence is that, like exceptions, the in flight instructions all
    get purged, and we save the committed RIP, RSP and interrupt control
    word.
    While that might be acceptable for a 5 stage in-order pipeline,
    it could be pretty expensive for an OoO 200+ instruction queue
    potentially tossing hundreds of cycles of near finished work.

    Lowest interrupt Latency
    Highest waste of power (i.e., work)

    Method 2 pipelines the switch by injecting the interrupt request at
    Fetch.
    Decode converts the request to a special uOp that travels down the IQ
    to Retire and allows all the older work to complete.
    This is more complex as it requires a two phase hand-off from the
    Interrupt Control Unit (ICU) to the core as a branch mispredict in the
    in flight instructions might cause a tentative interrupt acceptance to
    later be withdrawn.

    Interrupt latency is dependent on executing instructions,
    Lowest waste of power

    But note: In most cases, it already took the interrupt ~150 nanoseconds
    to arrive at the interrupt service port. 1 trip from device to DRAM
    (possibly serviced by L3), 1 trip from DRAM back to device, 1 tip from
    device to interrupt service port; and 4 DRAM (or L3) accesses to log
    interrupt into table.

    Also, in most cases, the 200-odd instructions in the window will finish
    in 100-cycles or as little as 20ns--but if the FDIV unit is saturated, interrupt latency could be as high as 640 cycles and as long as 640ns.

    The ICU believes the core is in a state to accept a higher priority interrupt. It sends a request to core, which checks its current state
    and
    sends back an immediate INT_ACK if _might_ accept and stalls Fetch, or a
    NAK.

    In My 66000, ICU knows nothing about the priority level (or state)
    of any core in the system. Instead, when a new higher priority
    interrupt is raised, the ISP broadcasts a 64-bit mask indicating
    which priority levels in the interrupt table have pending inter-
    rupts with an MMI/O message to the address of the interrupt table.

    All cores monitoring that interrupt table capture the broadcast,
    and each core decides to negotiate for an (not that) interrupt
    by requesting the highest priority interrupt from the table.

    When the request returns, and it is still at a higher priority
    than the core is running, core performs interrupt control transfer.
    If the interrupt is below the core's priority it is returned to
    ISP as if NAKed.

    Prior to interrupt control transfer, core remains running what-
    ever it was running--and all the interrupt stuff is done by state
    machines at the edge of the core and the L3/DRAM controller.

    When the special uOp reaches Retire, it sends a signal to Fetch which
    then sends an INT_ACCEPT signal to ICU to complete the handoff.
    If a branch mispredict occurs that causes interrupts to be disabled,
    then Fetch sends an INT_REJECT to ICU, and unstalls its fetching.
    (Yes that is not optimal - make it work first, make it work well
    second.)

    This also raises a question about what the ICU is doing during this
    long latency handoff. One wouldn't want ICU to sit idle so it might
    have to manage the handoff of multiple interrupts to multiple cores
    at the same time, each as its own little state machine.

    One must assume that ISP is capable of taking a new interrupt
    from a device every 5-ish cycles and interrupt handoff is in the
    range of 50 cycles, and that each interrupt could be to a different
    interrupt table.

    My 66000 ISP treats successive requests to any one table as strongly
    ordered, and requests to different tables as completely unordered.

    One should see that this decision on how the core handles the
    handoff has a large impact on the design complexity of the ICU.

    I did not "see" that in My 66000's interrupt architecture. The ISP
    complexity is fixed, and the core's interrupt negotiator is a small
    state machine (~10-states).

    ISP essentially performs 4-5 64-bit memory accesses, and possibly
    1 MMI/O 64-bit broadcast on arrival of MSI-X interrupt. Then if
    a core negotiates, it performs 3 more memory accesses per negotiator.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Mar 18 01:34:40 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:

    Michael S wrote:
    On Mon, 17 Mar 2025 13:38:12 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:
    ------------------

    The problem Robert is talking about arises when there are many
    interrupt source and many target CPUs.
    The required routing/prioritization/acknowledgment logic (at least a
    naive logic I am having in mind) would be either non-scalable or
    relatively complicated. Process of selection for the second case will
    take multiple cycles (I am thinking about ring).

    Another problem is what does the core do with the in flight
    instructions.

    Method 1 is simplest, it injects the interrupt request at Retire
    as that's where the state of everything is synchronized.
    The consequence is that, like exceptions, the in flight instructions all
    get purged, and we save the committed RIP, RSP and interrupt control
    word.
    While that might be acceptable for a 5 stage in-order pipeline,
    it could be pretty expensive for an OoO 200+ instruction queue
    potentially tossing hundreds of cycles of near finished work.

    Lowest interrupt Latency
    Highest waste of power (i.e., work)

    Method 2 pipelines the switch by injecting the interrupt request at
    Fetch.
    Decode converts the request to a special uOp that travels down the IQ
    to Retire and allows all the older work to complete.
    This is more complex as it requires a two phase hand-off from the
    Interrupt Control Unit (ICU) to the core as a branch mispredict in the
    in flight instructions might cause a tentative interrupt acceptance to
    later be withdrawn.

    Interrupt latency is dependent on executing instructions,
    Lowest waste of power

    But note: In most cases, it already took the interrupt ~150 nanoseconds
    to arrive at the interrupt service port. 1 trip from device to DRAM
    (possibly serviced by L3), 1 trip from DRAM back to device, 1 tip from
    device to interrupt service port; and 4 DRAM (or L3) accesses to log >interrupt into table.

    How do you handle timer interrupts? Those are generally not
    message based, but rather internal signals within the core that
    raise an interrupt (perhaps coordinating with the interrupt controller
    for priority, security state, and enable bit). These are generally
    wired (and the enables/state/priority are flops) for low latency.

    There will generally be at least one general purpose timer per
    core plus any watchdog timers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Mon Mar 24 12:37:13 2025
    In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 20 Mar 2025 20:25:59 +0000, Dan Cross wrote:
    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:
    In article <af2d54a7c6c694bf18bcca6e6876a72b@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:
    In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    So that you can have some mechanism for building critical
    sections that span multiple instructions, where just having
    access to atomics and spin locks isn't enough.

    You can do this with priority, too.

    In the limit, disabling interrupts is just setting the priority
    mask to infinity, so in some sense this is de facto true. But I
    don't see how having a prioritization scheme is all that useful.
    Consider that when you're starting the system up, you need to
    do things like set up an interrupt handler vector, and so on;

    The Interrupt vector is completely software. HW delivers the MSI-X >>>message and the service for which this message is for, and SW decides >>>what to do with this message to that service provider. No HW vectoring >>>but HW does deliver all the bits needed to guide SW to its goal.

    Maybe I'm missing the context here, and apologies if I am. But
    which interrupt disable are we talking about here, and where? I

    We are talking about the core's Interrupt Disablement and nothing
    of the gates farther out in the system that control interrupt
    deliveries (or not).

    Ok, that's what I thought.

    am coming at this from the perspective of systems software
    running on some CPU core somewhere; if I want that software to
    enter a critical section that cannot be interrupted, then it
    seems to me that MSI-X delivery is just a detail at that point.
    I might want to prevent IPIs, or delivery or local interrupts
    attached directly to the CPU (or something like an IOAPIC or
    GIC distributor).

    All of those flow through the same mechanism; even a locally
    generated timer interrupt.

    Right. So whether those interrupts come from MSI-X or anything
    else is mostly irrelevant.

    [snip/]
    Sometimes you really don't want to be interrupted.

    And sometimes you don't want to be interrupted unless the
    "house is on fire"; I cannot see a time when "the house is
    on fire" that you would not want to take the interrupt.

    Is there one ?!?

    Consider a thread that takes a spinlock; suppose some
    high-priority interrupt comes in while the thread is holding
    that lock. In response to the interrupt, software decides to
    suspend the thread and switch some other thread; that thread
    wants to lock the spin lock that the now-descheduled thread is
    holding: a classic deadlock scenario.

    If we can s/taken/attempts/ in the first line of that paragraph::

    If it has merely attempted, then it's a completely different
    scenario. Something that is more transactional in nature is no
    longer just a simple CAS-based spinlock. It may be _better_,
    but it is definitely different.

    My architecture has a mechanism to perform ATOMIC stuff over multiple >instruction time frames that has the property that a higher priority
    thread which interferers with a lower priority thread, the HPT wins
    and the LPT fails its ATOMIC event. It is for exactly this reason
    that I drag priority through the memory hierarchy--so that real-time
    things remain closer to real-time (no or limited priority inversions).

    Without being able to see this in practice, it's difficult to
    speculate as to how well it will actually work in real-world
    scenarios. What is the scope of what's covered by this atomic
    thing?

    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    In this case, I'm probably going to want to take a lock on the
    queue, then lock the list, then pop the first element off of the
    queue by taking a pointer to the head. Then I'll set the head
    to the next element of that item, then walk the list, finding
    the place to insert (make sure I'm keeping track of the "prev"
    pointer somehow), then do the dance to insert the item: set the
    item's next to the thing after where I'm inserting, set the
    prev's next to the item if it's not nil, or set the list head
    to the item. Then I unlock the list and then the queue (suppose
    for some reason that it's important that I hold the lock on the
    queue until the element is fully added to the list).

    This isn't complicated, and for a relatively small list it's not
    expensive. Depending on the circumstances, I'd feel comfortable
    doing this with only a spinlocks protecting both the queue and
    the list. But it's more than a two or three instructions, and
    doesn't feel like something that would fit well into the "may
    fail atomic" model.

    Mandating to SW that typ SW must use this mechanism will not fly,
    at least in the first several iterations of an OS for the system.
    So the std. methods must remain available.

    AND to a large extend this sub-thread is attempting to get a
    usefully large number of them annotated.

    It sounds like the whole model is fairly different, and would
    require software writers to consider interrupts occuring in
    places that wouldn't have to in a more conventional design. I
    suspect you'd have a fairly serious porting issue with existing
    systems, but maybe I misunderstand.

    A valid response here might be, "don't context switch from the
    interrupt handler; use a DPC instead". That may be valid, but
    it puts a constraint on the software designer that may be onerus
    in practice: suppose the interrupt happens in the scheduler,
    while examining a run queue or something. A DPC object must be
    available, etc.

    This seems to be onerous on SW mainly because of too many unknowns
    to track down and deal with.

    Yes.

    Further, software must now consider the complexity of
    potentially interruptable critical sections. From the
    standpoint of reasoning about already-complex concurrency issues
    it's simpler to be able to assert that (almost) all interrupt
    delivery can be cheaply disabled entirely, save for very
    special, specific, scenarios like NMIs. Potentially switching
    away from a thread holding a spinlock sort of defeats the
    purpose of a spinlock in the first place, which is a mutex
    primitive designed to avoid the overhead of switching.

    Agreed.

    The idea of priority as described above seems to prize latency
    above other considerations, but it's not clear to me that that
    is the right tradeoff. Exactly what problem is being solved?

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Dan Cross on Mon Mar 24 13:59:49 2025
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:


    My architecture has a mechanism to perform ATOMIC stuff over multiple >>instruction time frames that has the property that a higher priority
    thread which interferers with a lower priority thread, the HPT wins
    and the LPT fails its ATOMIC event. It is for exactly this reason
    that I drag priority through the memory hierarchy--so that real-time
    things remain closer to real-time (no or limited priority inversions).

    Without being able to see this in practice, it's difficult to
    speculate as to how well it will actually work in real-world
    scenarios. What is the scope of what's covered by this atomic
    thing?

    Sounds a lot like transactional memory. Something that has
    yet to prove to be usable in the general case.


    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    In this case, I'm probably going to want to take a lock on the
    queue, then lock the list, then pop the first element off of the
    queue by taking a pointer to the head.

    Which can be done with compare and exchange, atomically;
    a lock may not be necessary.

    The insert into an ordered list will likely require a
    spin lock (or a reader-writer lock of some form).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Mon Mar 24 11:55:33 2025
    MitchAlsup1 wrote:
    On Sun, 23 Mar 2025 16:16:16 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 20 Mar 2025 12:50:09 +0000, Dan Cross wrote:

    In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    But let me turn it around, what real-world problem does this
    interrupt prioritization scheme solve coupled with the inability
    to disable interrupts solve?

    The OS does not have to "walk" its schedule queues in order to
    context switch.

    The SVR (SuperVisor Return) instruction checks the priority of
    the thread that had been interrupted, and if it is of lower
    priority than what is pending in the interrupt table, the
    one on the interrupt table receives control while the lower
    priority thread remains interrupted.

    Thus, the return path from an ISR, just does an SVR. If a
    DPC/softIRQ has been setup, or another interrupt has become
    pending, HW chooses the path--acting as if HW returned control
    to the interrupted thread, and then immediately took the inter-
    rupt of highest priority still pending--without having executed
    a single instruction in the interrupted thread.

    Is anyone complaining that the
    ability to disable interrupts is somehow a serious burden?

    This thread started with the notion of MSI interrupts and the
    inherent need to DI upon arrival at ISR dispatcher. This sub-
    thread is investigating whether one can perform ISRs without
    disabling interrupts. Thus, making IDisablement an unnecessary
    part of interrupt servicing itself.

    Your (and EricP)'s contribution is to annotate those instances
    where one can arrive at an ISR IE and still want/need to DI and EI
    independent of the control transfer from and back to interrupted
    thread (even if control arrives reentrantly from interrupted thread).

    Supervisor return, whatever it is called - sysret, rei, svr -
    is traditionally another point where a number of OS race conditions
    can force it to briefly disable interrupts.

    The classic case is when a DPC/SoftIrq may be posted by an Interrupt
    Service Routine (ISR) and causes the First Level Interrupt Handler
    (FLIH),
    which is the ISR nested first on the stack of priority interrupts,
    to exit into the OS rather than return to its prior thread.

    The race condition occurs between the point when the FLIH ISR epilogue
    checks for pending a pending DPC, and not finding one decides to return
    to the prior thread. Between that check and the SVR instruction a higher
    priority interrupt occurs and posts a DPC.

    I remove that race condition by removing the need for FLIR (which I
    call the dispatcher) to check. All the checks are done AT* the SVR instruction. (*) and during

    Ok you have that but other platforms must deal with this in code.
    VAX used a real priority-1 (lowest) interrupt looped back to itself
    to request a software interrupt, eliminating this check.

    Intel APIC allows this kind of self interrupt loop back too but
    I don't know if any OS uses it for its internal scheduling.

    On platforms without something like this the OS has to deal
    with this race condition, which requires disabling interrupts.

    Also on PC's the 8259 PIC has an End Of Interrupt (EOI) race condition,
    and it looks like the APIC has the same one, which requires briefly
    disabling interrupts in the return path.

    Another race condition can be how does an epilogue know if it is a FLIH
    or nested one? A counter for each core incremented by the ISR prologue
    and checked by the epilogue would do but what if the nested interrupt
    catches a lower level prologue just before it increments the counter.

    How to know:: It COULD look at the value in its stack pointer. An
    empty stack implies it is leaving, a non-empty stack implies it
    has more to do.

    The stack pointer might be too dynamic for this.
    It can test the if the return RIP is in a certain range.
    If it is inside a prologue you know it is nested.
    If inside the epilogue then it might have just caught the exit race
    so reset the return RIP to the start of the epilogue exit sequence.

    By arranging for all the prologues and epilogue to be at the highest
    addresses then a CMP and BGEU filters out the 99.999% usual cases.

    But all of this adds to the return from interrupt code path length.

    But in practice, it does not need to know as interrupts are properly
    nested and serviced in priority order.

    It is not checking whether they are properly nested.
    Only the FLIH does the check whether to return or exit through the OS.
    That is to prevent Dispatcher reentrancy.

    And this check can be replaced by a priority-1 self interrupt capability
    to trap into the software interrupt dispatcher.

    There are a number of non-interruptible checks and control flag changes
    that supervisor return must perform.

    To do this I have a single instruction CSVR Conditional Supervisor
    Return
    which handles all interrupt, exception, and SysCall returns. It performs
    a model specific test on an interrupt control register and if True
    changes
    the ICR state and returns to the prior context. If False makes other ICR
    state changes and falls through the CSVR instruction where code can
    determine what just changed. This would typically be a jump into the OS
    Dispatcher to flush any DPC/SoftIrq and check for thread rescheduling.
    This whole return sequence is performed without having to disable
    interrupts.

    I, too, have a single instruction (SVR) that delivers control to the
    highest priority still pending thread, regardless of whether it is
    a new interrupt, a scheduled DPC/softIRQ, a local timer, an IPI, or
    the thread the interrupt stack was placed upon.

    SVR looks at the priority of the thread to receive control, and the priorities of pending 'interrupts' (of which DPC/softIRQ and IPIs
    are part) and transfers control to the highest priority pending thread--occasionally that thread is the interrupted one.

    The new idea I'm proposing is CSVR *Conditional* Supervisor Return
    which can fall through in the unlikely event something disturbs the non-interruptible return sequence.

    This can be used to optimize the return path by avoiding returning from
    an interrupt only to immediately take a priority-1 software interrupt.

    Should a higher priority thread become pending while SVR is in progress,
    the higher priority thread will receive control before the first inst- ruction of the mis-receiving thread is performed.

    As far as you have explained these things to me, I see no race
    conditions
    in the architected means.

    Perhaps not for your approach but I'm looking at other optimizations
    that might introduce them. One was optimizing the common return path:

    Before a supervisor return starts to reload the saved register set
    it tests if a DPC is pending and about to generate a software interrupt,
    and if so jumps back into the Dispatcher. This saves reloading the registers only to immediately take another interrupt, a pipeline drain and fill,
    and resave the same registers.

    To do this optimization, before it starts to reload the registers
    it tests the Interrupt Control Register (ICR) to check if it is about to trigger a SWI and if so adjusts the ICR and jumps into the Dispatcher. Otherwise it restores registers and does a Conditional Supervisor Return
    which reperforms the same test and can fall through if the test fails.

    This also requires that kernel/interrupt stacks be switched a certain way, which just so happens to be the same as the x86 way. That is, hardware only switches stacks if interrupted mode is User. This leaves it up to OS
    software to decide which stack to use at what points inside the kernel.

    Also I'd save the interrupt context, the old RIP, old RSP and old interrupt status in hardware registers, not on the stack, and the interrupt vector
    RIP address is calculated based on the priority. This eliminates all memory access from the exception and interrupt entry and exit transition sequences, and thereby eliminates page faults and memory error possibility and
    cache access latency.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Mar 24 14:02:25 2025
    MitchAlsup1 [2025-03-24 17:36:28] wrote:
    On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:
    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    [...]
    I have almost that in the documentation for my ATOMIC stuff::
    The following code moves a doubly linked element from one place
    in a concurrent data structure to another without any interested
    3rd party being able to see that the element is now within the
    CDS at any time::

    Would your "ATOMIC stuff" really handle Dan's example, tho?
    A key element in Dan's example if the "ordered singly-linked list",
    i.e. you need to walk the list an unbounded number of steps until you
    can find the insertion point, so I don't think you can do the whole
    thing inside your ATOMIC stuff.

    IOW, you'd have to somehow move the list-walk outside of the atomic section. E.g. if you know elements are only ever removed from the tail of the
    list, you can presumably walk the list outside of the ATOMIC stuff and
    then do the "remove from the queue and insert into the list" atomically
    since it's a small&finite number of steps (with a test to loop back to
    the list-walk if someone else inserted or removed anything at that point
    in the mean time).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Mar 24 17:47:48 2025
    On Mon, 24 Mar 2025 13:59:49 +0000, Scott Lurndal wrote:

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:


    My architecture has a mechanism to perform ATOMIC stuff over multiple >>>instruction time frames that has the property that a higher priority >>>thread which interferers with a lower priority thread, the HPT wins
    and the LPT fails its ATOMIC event. It is for exactly this reason
    that I drag priority through the memory hierarchy--so that real-time >>>things remain closer to real-time (no or limited priority inversions).

    Without being able to see this in practice, it's difficult to
    speculate as to how well it will actually work in real-world
    scenarios. What is the scope of what's covered by this atomic
    thing?

    Sounds a lot like transactional memory. Something that has
    yet to prove to be usable in the general case.

    It is not TM, but can be used to implement TM should you like.
    It is a set of instructions that can implement any (known)
    ATOMIC process (TestAndSet, TestAndTestAndSet; lock, unlock,
    LD-locked, ST-conditional, CompareSwap, CompareDouble, SwapDouble CompareTriple, SwapTriple, RemoveElement, InsertElement,
    MoveElement, ...}

    It is basically a recipe book so SW can implement whatever kinds
    of ATOMIC things SW wants, leaving HW out of the loop.

    you can also consider it a fully pipelined version of LDL and STC
    with up to 8 cache lines of data available for the event.


    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    In this case, I'm probably going to want to take a lock on the
    queue, then lock the list, then pop the first element off of the
    queue by taking a pointer to the head.

    Which can be done with compare and exchange, atomically;
    a lock may not be necessary.

    The insert into an ordered list will likely require a
    spin lock (or a reader-writer lock of some form).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Dan Cross on Mon Mar 24 13:46:41 2025
    Dan Cross wrote:
    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:

    Sometimes you really don't want to be interrupted.
    And sometimes you don't want to be interrupted unless the
    "house is on fire"; I cannot see a time when "the house is
    on fire" that you would not want to take the interrupt.

    Is there one ?!?

    Consider a thread that takes a spinlock; suppose some
    high-priority interrupt comes in while the thread is holding
    that lock. In response to the interrupt, software decides to
    suspend the thread and switch some other thread; that thread
    wants to lock the spin lock that the now-descheduled thread is
    holding: a classic deadlock scenario.

    Terminology: mutexes coordinate mutual exclusion between threads,
    spinlocks coordinate mutual exclusion between cpu cores.
    Windows "critical sections" are mutexes with a fast path.

    A valid response here might be, "don't context switch from the
    interrupt handler; use a DPC instead". That may be valid, but
    it puts a constraint on the software designer that may be onerus
    in practice: suppose the interrupt happens in the scheduler,
    while examining a run queue or something. A DPC object must be
    available, etc.

    That is exactly what DPC/SoftIrq are design to do - allow the bulk of
    the kernel, like the scheduler, non-paged heap management, IO pre and
    post processing, to be interrupted without causing reentrancy.
    The higher level device interrupt enqueues a request to the lower software interrupt level, which is processed when it will not cause reentrancy.

    That constraint is by design and any good OS will crash if violated.

    Further, software must now consider the complexity of
    potentially interruptable critical sections. From the
    standpoint of reasoning about already-complex concurrency issues
    it's simpler to be able to assert that (almost) all interrupt
    delivery can be cheaply disabled entirely, save for very
    special, specific, scenarios like NMIs. Potentially switching
    away from a thread holding a spinlock sort of defeats the
    purpose of a spinlock in the first place, which is a mutex
    primitive designed to avoid the overhead of switching.

    - Dan C.

    This is why I mentioned the terminology thing: threads do not hold
    spinlocks, they hold mutexes. Threads and mutexes are a fiction created
    by the OS scheduler, and switching threads while waiting for a mutex
    is exactly what they are supposed to do.

    Spinlocks are used by cores to coordinate access to shared memory by
    things like a OS scheduler looking at list of threads waiting for a mutex.

    I like to think of it as all having to do with hats.
    The cpu is wearing one of three hats: a thread hat when it pretends to be
    a time shared execution machine; a core hat when it acts as a single
    execution machine running non-reentrant things like a scheduler which
    creates the fiction of threads and processes (multiple virtual spaces);
    and an interrupt hat when executing nestable reentrant interrupts.

    The interrupt mechanism coordinates exactly when and how we switch hats.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Mon Mar 24 17:36:28 2025
    On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:

    In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    ---------------------
    My architecture has a mechanism to perform ATOMIC stuff over multiple >>instruction time frames that has the property that a higher priority
    thread which interferers with a lower priority thread, the HPT wins
    and the LPT fails its ATOMIC event. It is for exactly this reason
    that I drag priority through the memory hierarchy--so that real-time
    things remain closer to real-time (no or limited priority inversions).

    Without being able to see this in practice, it's difficult to
    speculate as to how well it will actually work in real-world
    scenarios. What is the scope of what's covered by this atomic
    thing?

    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    In this case, I'm probably going to want to take a lock on the
    queue, then lock the list, then pop the first element off of the
    queue by taking a pointer to the head. Then I'll set the head
    to the next element of that item, then walk the list, finding
    the place to insert (make sure I'm keeping track of the "prev"
    pointer somehow), then do the dance to insert the item: set the
    item's next to the thing after where I'm inserting, set the
    prev's next to the item if it's not nil, or set the list head
    to the item. Then I unlock the list and then the queue (suppose
    for some reason that it's important that I hold the lock on the
    queue until the element is fully added to the list).

    I have almost that in the documentation for my ATOMIC stuff::
    The following code moves a doubly linked element from one place
    in a concurrent data structure to another without any interested
    3rd party being able to see that the element is now within the
    CDS at any time::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    The ESM macros simply tell the compiler to set a bit in the LD/ST
    instructions. esmINTERFERENCE is a query on the state of the event
    and whether a 2nd party has performed a memory action that kills
    the event.

    It is required to touch every cache line participating in the event
    prior to attempting the ATOMIC update--this is what esmLOCKprefetch
    does. One can use up to 8 cache lines in a single ATOMIC event.

    Either everyone in the system sees the update and at the same
    instant in time, or no one sees any update at all.

    Technically, there is no one holding a lock ...


    This isn't complicated, and for a relatively small list it's not
    expensive. Depending on the circumstances, I'd feel comfortable
    doing this with only a spinlocks protecting both the queue and
    the list. But it's more than a two or three instructions, and
    doesn't feel like something that would fit well into the "may
    fail atomic" model.

    Should be obvious how to change the insertion into a singly linked
    list in a different data structure.

    Mandating to SW that typ SW must use this mechanism will not fly,
    at least in the first several iterations of an OS for the system.
    So the std. methods must remain available.

    AND to a large extend this sub-thread is attempting to get a
    usefully large number of them annotated.

    It sounds like the whole model is fairly different, and would
    require software writers to consider interrupts occuring in
    places that wouldn't have to in a more conventional design. I
    suspect you'd have a fairly serious porting issue with existing
    systems, but maybe I misunderstand.

    Which is why ID is available.

    A valid response here might be, "don't context switch from the
    interrupt handler; use a DPC instead". That may be valid, but
    it puts a constraint on the software designer that may be onerus
    in practice: suppose the interrupt happens in the scheduler,
    while examining a run queue or something. A DPC object must be
    available, etc.

    This seems to be onerous on SW mainly because of too many unknowns
    to track down and deal with.

    Yes.

    Further, software must now consider the complexity of
    potentially interruptable critical sections. From the
    standpoint of reasoning about already-complex concurrency issues
    it's simpler to be able to assert that (almost) all interrupt
    delivery can be cheaply disabled entirely, save for very
    special, specific, scenarios like NMIs. Potentially switching
    away from a thread holding a spinlock sort of defeats the
    purpose of a spinlock in the first place, which is a mutex
    primitive designed to avoid the overhead of switching.

    Agreed.

    The idea of priority as described above seems to prize latency
    above other considerations, but it's not clear to me that that
    is the right tradeoff. Exactly what problem is being solved?

    The instruction path into and out of handlers is shorter. Basically
    nobody has to look at the schedule queues because HW will do it for
    you. For example: GuestOS interrupt dispatcher is approximately::

    -- +-------------+-----+------------------------+
    R0 | ----------- | ISR | --- MSI-X message ---- |
    -- +-------------+-----+------------------------+

    Dispatcher:
    // at this point we are reentrant
    SR R7,R0,<39:32>
    CMP R2,R7,MaxTable
    PLO R2,TTTF
    SR R1,R0,<32:0>
    CALX [IP,R7<<3,Table-.]
    SVR
    // if we get here R0<37:32>
    // was out of bounds

    This does all of the register file saving and restoring, all of
    the DPC/softIRQ stuff, all of the checking for additional ISRs,
    IPIs, ... Changes priority, privilege, ROOT pointers and associated
    ASIDs from and back to.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Scott Lurndal on Mon Mar 24 18:49:45 2025
    In article <pzdEP.126362$f5K3.26821@fx36.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    My architecture has a mechanism to perform ATOMIC stuff over multiple >>>instruction time frames that has the property that a higher priority >>>thread which interferers with a lower priority thread, the HPT wins
    and the LPT fails its ATOMIC event. It is for exactly this reason
    that I drag priority through the memory hierarchy--so that real-time >>>things remain closer to real-time (no or limited priority inversions).

    Without being able to see this in practice, it's difficult to
    speculate as to how well it will actually work in real-world
    scenarios. What is the scope of what's covered by this atomic
    thing?

    Sounds a lot like transactional memory. Something that has
    yet to prove to be usable in the general case.

    I understand that some mallocs have done it; McRT-Malloc from
    Intel, if I'm not mistaken.

    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    In this case, I'm probably going to want to take a lock on the
    queue, then lock the list, then pop the first element off of the
    queue by taking a pointer to the head.

    Which can be done with compare and exchange, atomically;
    a lock may not be necessary.

    The insert into an ordered list will likely require a
    spin lock (or a reader-writer lock of some form).

    I'm sure one could do a lockless pop using a cmp exchange, but I
    wanted to a scenario where one would want to hold onto both
    locks throughout the critical section for some reason.
    I just don't see how this works in the proposed scenario.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Mon Mar 24 18:39:25 2025
    In article <536efe4d9fd254ac1dfd37f9d9a43838@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 20 Mar 2025 12:50:09 +0000, Dan Cross wrote:

    In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    But let me turn it around, what real-world problem does this
    interrupt prioritization scheme solve coupled with the inability
    to disable interrupts solve?

    The OS does not have to "walk" its schedule queues in order to
    context switch.

    Of course it does. The OS is the entity that selects the next
    thing to run and assigns priorities. Even in a scheme where the
    hardware has a notion of priority that the operating system can
    rely on directly, priorities can change over time and the OS has
    to be able to account for that.

    The SVR (SuperVisor Return) instruction checks the priority of
    the thread that had been interrupted, and if it is of lower
    priority than what is pending in the interrupt table, the
    one on the interrupt table receives control while the lower
    priority thread remains interrupted.

    First, this sounds completely non-portable. Second, how does
    one handle scheduling policies around threads that are all at
    the same priority level? Third, as an OS person, this is a pet
    peeve of mine, from both hardware and language perspectives: how
    does the hardware guarantee that it's notion of a "thread" is
    the same as mine? I don't like it when systems force a
    particular "shape" of a solution on me. POSIX threads in
    hardware is something that I explicitly do not want, because any
    impedence mismatch between the hardware idea and what I need in
    the OS create friction, and limits my ability to build the
    system that I want.

    Thus, the return path from an ISR, just does an SVR. If a
    DPC/softIRQ has been setup, or another interrupt has become
    pending, HW chooses the path--acting as if HW returned control
    to the interrupted thread, and then immediately took the inter-
    rupt of highest priority still pending--without having executed
    a single instruction in the interrupted thread.

    See above. As an OS person, I find this level of intimacy with
    the hardware horrifying. It reminds me of the VAX queue
    instructions; neat idea, VMS used them, but awful in practice
    and Unix largely ignored them. As a result, it made VMS harder
    to port than was necessary.

    Is anyone complaining that the
    ability to disable interrupts is somehow a serious burden?

    This thread started with the notion of MSI interrupts and the
    inherent need to DI upon arrival at ISR dispatcher. This sub-
    thread is investigating whether one can perform ISRs without
    disabling interrupts. Thus, making IDisablement an unnecessary
    part of interrupt servicing itself.

    Yes, but why?

    I can see some neat symmetry with, say, replacing NMIs with
    "infinite priority" interrupts, and perhaps using an interrupt
    priority mask to de facto disable interrupt delivery at the
    local core; maybe that's useful internally. But I'm not seeing
    the use case beyond that.

    Your (and EricP)'s contribution is to annotate those instances
    where one can arrive at an ISR IE and still want/need to DI and EI >independent of the control transfer from and back to interrupted
    thread (even if control arrives reentrantly from interrupted thread).

    Many computer designers copy/modify their interrupt architectures
    from machines architected prior to the virtualization of Operating
    Systems, virtualization of interrupts, HyperVisors, Paranoid threads
    (like banking applications) that have survived into the present.

    I went the other way, trying to develop a system whereby the path
    across context switching domains is cheap and easy*. Not needing
    to DI at entry to ISR or exception handlers (more or less) fell
    out for free due to the way "the rest of the system" evolved.

    (*) cheap for HW, and easy expressly for SW

    Not being employed, and not having an OS guy to bounce ideas off
    and discuss consequences of going this way or that way is not
    available, so I lean on a few people hear to "beat me up" when
    my ideas lead me astray.

    18 months ago I had a real-time version of the architecture,
    with the property that at any instant in time, each core was
    executing the highest priority thread, cores had a means to
    message back and forth as if it were a remote procedure call
    which could return (synchronous; or not--asynchronous) without
    an OS mediating these things. I abandoned it not because it
    was not good for real time, but because there was no way to
    make it Linux Friendly.

    Where I am now, is my attempt at being Linux friendly.

    This sounds like it may be interesting for a hard real-time
    controller type application with a very thin RTOS, but it's not
    all that useful for a general-purpose system.

    Here's the problem, though. Being "Linux friendly" is going to
    impose design constraints on you that will force you in the
    direction of producing an architecture that is far more
    conventional than what you are proposing above. If that's what
    you want, that's totally fine of course; but if you want to
    experiment with a substantially different system architecture,
    you're not going to accept that you're software story is going
    to be largely bespoke and unportable. That's not to say your
    _userspace_ story has to be like that, but your OS will.

    Moreover, there's the issue of, how do you expose these
    primitives through a high-level language? Most programmers are
    not going to be willing to drop down to assembler for these
    atomic sequences; maybe to build a mutex primitive or something,
    but not for much larger things.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Mon Mar 24 18:55:31 2025
    In article <33d525dc9450a654173f9313f0564224@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:
    In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    ---------------------
    My architecture has a mechanism to perform ATOMIC stuff over multiple >>>instruction time frames that has the property that a higher priority >>>thread which interferers with a lower priority thread, the HPT wins
    and the LPT fails its ATOMIC event. It is for exactly this reason
    that I drag priority through the memory hierarchy--so that real-time >>>things remain closer to real-time (no or limited priority inversions).

    Without being able to see this in practice, it's difficult to
    speculate as to how well it will actually work in real-world
    scenarios. What is the scope of what's covered by this atomic
    thing?

    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    In this case, I'm probably going to want to take a lock on the
    queue, then lock the list, then pop the first element off of the
    queue by taking a pointer to the head. Then I'll set the head
    to the next element of that item, then walk the list, finding
    the place to insert (make sure I'm keeping track of the "prev"
    pointer somehow), then do the dance to insert the item: set the
    item's next to the thing after where I'm inserting, set the
    prev's next to the item if it's not nil, or set the list head
    to the item. Then I unlock the list and then the queue (suppose
    for some reason that it's important that I hold the lock on the
    queue until the element is fully added to the list).

    I have almost that in the documentation for my ATOMIC stuff::
    The following code moves a doubly linked element from one place
    in a concurrent data structure to another without any interested
    3rd party being able to see that the element is now within the
    CDS at any time::

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    The ESM macros simply tell the compiler to set a bit in the LD/ST >instructions. esmINTERFERENCE is a query on the state of the event
    and whether a 2nd party has performed a memory action that kills
    the event.

    It is required to touch every cache line participating in the event
    prior to attempting the ATOMIC update--this is what esmLOCKprefetch
    does. One can use up to 8 cache lines in a single ATOMIC event.

    Either everyone in the system sees the update and at the same
    instant in time, or no one sees any update at all.

    Technically, there is no one holding a lock ...

    Suppose I need to get the first two elements of the list.

    This isn't complicated, and for a relatively small list it's not
    expensive. Depending on the circumstances, I'd feel comfortable
    doing this with only a spinlocks protecting both the queue and
    the list. But it's more than a two or three instructions, and
    doesn't feel like something that would fit well into the "may
    fail atomic" model.

    Should be obvious how to change the insertion into a singly linked
    list in a different data structure.

    Mandating to SW that typ SW must use this mechanism will not fly,
    at least in the first several iterations of an OS for the system.
    So the std. methods must remain available.

    AND to a large extend this sub-thread is attempting to get a
    usefully large number of them annotated.

    It sounds like the whole model is fairly different, and would
    require software writers to consider interrupts occuring in
    places that wouldn't have to in a more conventional design. I
    suspect you'd have a fairly serious porting issue with existing
    systems, but maybe I misunderstand.

    Which is why ID is available.

    A valid response here might be, "don't context switch from the
    interrupt handler; use a DPC instead". That may be valid, but
    it puts a constraint on the software designer that may be onerus
    in practice: suppose the interrupt happens in the scheduler,
    while examining a run queue or something. A DPC object must be
    available, etc.

    This seems to be onerous on SW mainly because of too many unknowns
    to track down and deal with.

    Yes.

    Further, software must now consider the complexity of
    potentially interruptable critical sections. From the
    standpoint of reasoning about already-complex concurrency issues
    it's simpler to be able to assert that (almost) all interrupt
    delivery can be cheaply disabled entirely, save for very
    special, specific, scenarios like NMIs. Potentially switching
    away from a thread holding a spinlock sort of defeats the
    purpose of a spinlock in the first place, which is a mutex
    primitive designed to avoid the overhead of switching.

    Agreed.

    The idea of priority as described above seems to prize latency
    above other considerations, but it's not clear to me that that
    is the right tradeoff. Exactly what problem is being solved?

    The instruction path into and out of handlers is shorter. Basically
    nobody has to look at the schedule queues because HW will do it for
    you. For example: GuestOS interrupt dispatcher is approximately::

    -- +-------------+-----+------------------------+
    R0 | ----------- | ISR | --- MSI-X message ---- |
    -- +-------------+-----+------------------------+

    Dispatcher:
    // at this point we are reentrant
    SR R7,R0,<39:32>
    CMP R2,R7,MaxTable
    PLO R2,TTTF
    SR R1,R0,<32:0>
    CALX [IP,R7<<3,Table-.]
    SVR
    // if we get here R0<37:32>
    // was out of bounds

    This does all of the register file saving and restoring, all of
    the DPC/softIRQ stuff, all of the checking for additional ISRs,
    IPIs, ... Changes priority, privilege, ROOT pointers and associated
    ASIDs from and back to.

    But again, why is that important? Is that a significant
    bottleneck for systems software? Is it particularly burdensome
    for software for some reason?

    I explicitly don't want the hardware to have that level of
    control in my scheduling subsystem.

    This sounds like a solution looking for a problem.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon Mar 24 18:56:50 2025
    On Mon, 24 Mar 2025 15:55:33 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Sun, 23 Mar 2025 16:16:16 +0000, EricP wrote:

    I may come back to the snipped stuff later


    But all of this adds to the return from interrupt code path length.

    As noted in my response to Dan C, the return from interrupt code path
    length is the RET instruction from the ISR back to Dispatcher and the
    SVR instruction at the Dispatcher. Is this short enough?

    ------------------
    I, too, have a single instruction (SVR) that delivers control to the
    highest priority still pending thread, regardless of whether it is
    a new interrupt, a scheduled DPC/softIRQ, a local timer, an IPI, or
    the thread the interrupt stack was placed upon.

    SVR looks at the priority of the thread to receive control, and the
    priorities of pending 'interrupts' (of which DPC/softIRQ and IPIs
    are part) and transfers control to the highest priority pending
    thread--occasionally that thread is the interrupted one.

    The new idea I'm proposing is CSVR *Conditional* Supervisor Return
    which can fall through in the unlikely event something disturbs the non-interruptible return sequence.

    I detected that. My SVR has most of those properties (can do several
    different return sequences including take the next interrupt) and does
    not necessarily return whence the interrupt was taken.

    This can be used to optimize the return path by avoiding returning from
    an interrupt only to immediately take a priority-1 software interrupt.

    My 66000 SVR exits a current handler, but instead of returning to the interrupted thread, it checks for any other interrupt disturbances
    and if it finds one of higher priority than the interrupted thread, it
    does the higher priority one(s) first, leaving the interrupted thread
    for later.

    SW does not have to look. SW just has to send DPCs/softIRQs as MSI-X
    messages to the interrupt table of its choice. So, setting up a DPC/
    softIRQ requires storing some data in a SW controlled place and manner,
    and performing a single ST to the interrupt aperture to enQueue the
    message at chosen priority and privilege, to be serviced by chosen
    handler.

    Should a higher priority thread become pending while SVR is in progress,
    the higher priority thread will receive control before the first inst-
    ruction of the mis-receiving thread is performed.

    As far as you have explained these things to me, I see no race
    conditions
    in the architected means.

    Perhaps not for your approach but I'm looking at other optimizations
    that might introduce them. One was optimizing the common return path:

    It is currently standing at 2 instructions--an RET from ISR back to its dispatcher, followed immediately by a SVR.

    Before a supervisor return starts to reload the saved register set
    it tests if a DPC is pending and about to generate a software interrupt,
    and if so jumps back into the Dispatcher. This saves reloading the
    registers
    only to immediately take another interrupt, a pipeline drain and fill,
    and resave the same registers.

    Yes, yes, my SVR does that.

    To do this optimization, before it starts to reload the registers
    it tests the Interrupt Control Register (ICR) to check if it is about to trigger a SWI and if so adjusts the ICR and jumps into the Dispatcher. Otherwise it restores registers and does a Conditional Supervisor Return which reperforms the same test and can fall through if the test fails.

    Because I do things in HW, I can start the RF reload (bring in the 5
    cache lines of the thread) and while waiting for them to arrive, look
    at the pending interrupts and abandon the return, ignoring the fetched
    thread state; and proceed back to the top of Dispatcher. So, if SVR
    finds more stuff to do, it can continue to use the RF it has.

    But, because the next interrupt may be Hyper instead of Super (or
    even vice versa) architecture states that HW uses the RF of the
    current privilege level. This may require doing nothing (super->super)
    or swapping RF (hyper->super or super->secure, from != to)

    Architecturally, HW is always using the thread-state of core SW-stack
    context pointers indexed by the privilege level.

    This also requires that kernel/interrupt stacks be switched a certain
    way,
    which just so happens to be the same as the x86 way. That is, hardware
    only
    switches stacks if interrupted mode is User. This leaves it up to OS
    software to decide which stack to use at what points inside the kernel.

    Does x86 HW switch stacks when going from super to hyper or hyper to
    Secure ?? like it does from user to super ??
    --------
    My 66000 has the property that a core runs a SW stack, each element
    on the stack points at thread-state. Thread-state has all the control
    registers needed to run a thread and a register file for each level
    on the stack.

    It is possible to program the system such that a (semi-trusted) user
    (no privilege) is allowed to "take" his own interrupts. When doing
    so, he really wants to use his own (non-privileged) stack.

    It is also possible for a single interrupt table to hold interrupts
    for any privilege level at any priority.
    ---------
    You may ask why ? One of the goals here is to take the same number of
    cycles to context switch between user processes as it does to switch
    between different HyperVisors. Currently these are in the 1000 cycle
    range (user->user) and 10,000 range for (Guest OS[1]->Guest OS[3]).
    I want both of these down in the sub-100 range close; to >20 cycles
    plus memory latency.

    For example, one can context switch the user process by overwriting
    its core software stack CR with the new process' Thread-state pointer
    (and then when SVR happens the new process runs instead of the old
    process) -- context switch is essentially 1 instruction. This happens
    at Guest OS, and Hypervisor levels of the SW stack, too.

    Supervisory software has the same path to figure out what to do
    (next) and then once decided the code-path to "make it happen"
    is tiny.

    Also I'd save the interrupt context, the old RIP, old RSP and old
    interrupt
    status in hardware registers, not on the stack, and the interrupt vector
    RIP address is calculated based on the priority. This eliminates all
    memory
    access from the exception and interrupt entry and exit transition
    sequences,
    and thereby eliminates page faults and memory error possibility and
    cache access latency.

    As you remain aware, all thread control registers have a known memory
    address from which they are read and written, these do not go on any
    stack. Along with the RF they occupy 5 sequential cache lines. HW
    treats them (essentially) as a write back cache.

    The only things on the supervisor stack is the original value in R0
    (which I had to clobber to put the MSI-X message in. And a 2-bit
    of the interrupted privilege level and a 4-bit token telling SVR
    what kind of stack it is dealing with {Interrupt, exception, SVC}.)
    {{Looking at this now, I could probably get rid of R0 on the stack))

    One of the reasons I want HW doing these things, is HW strips off
    bits SW is not supposed to be looking at. It is not clear is even
    supervisory SW can be so trusted--witness we have (essentially
    unprivileged) units of SW that do not even want GuetOS to look at
    their data (banking applications, video, rented audio, ...). They
    want GuestOS to perform the I/O but not look at any of the data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Mon Mar 24 19:09:17 2025
    On Mon, 24 Mar 2025 18:02:25 +0000, Stefan Monnier wrote:

    MitchAlsup1 [2025-03-24 17:36:28] wrote:
    On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:
    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    [...]
    I have almost that in the documentation for my ATOMIC stuff::
    The following code moves a doubly linked element from one place
    in a concurrent data structure to another without any interested
    3rd party being able to see that the element is now within the
    CDS at any time::

    Would your "ATOMIC stuff" really handle Dan's example, tho?
    A key element in Dan's example if the "ordered singly-linked list",
    i.e. you need to walk the list an unbounded number of steps until you
    can find the insertion point, so I don't think you can do the whole
    thing inside your ATOMIC stuff.

    Technically, yes, you can put a search loop in an ATOOMIC event.
    Whether you want to or not is a different story. My example
    illustrated the ATOMIC manipulation of the event without a search.

    In addition, if the search it long, code developer should either
    to the search before the event, or grab a Mutex to prevent
    interference during the search and rest of the critical region.

    IOW, you'd have to somehow move the list-walk outside of the atomic
    section.

    Yes,

    E.g. if you know elements are only ever removed from the tail of the
    list, you can presumably walk the list outside of the ATOMIC stuff and
    then do the "remove from the queue and insert into the list" atomically
    since it's a small&finite number of steps (with a test to loop back to
    the list-walk if someone else inserted or removed anything at that point
    in the mean time).

    Or maintain a tail pointer.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Mon Mar 24 19:46:08 2025
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 3/24/2025 11:49 AM, Dan Cross wrote:
    In article <pzdEP.126362$f5K3.26821@fx36.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <4603ec2d5082f16ab0588b4b9d6f96c7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    My architecture has a mechanism to perform ATOMIC stuff over multiple >>>>> instruction time frames that has the property that a higher priority >>>>> thread which interferers with a lower priority thread, the HPT wins
    and the LPT fails its ATOMIC event. It is for exactly this reason
    that I drag priority through the memory hierarchy--so that real-time >>>>> things remain closer to real-time (no or limited priority inversions). >>>>
    Without being able to see this in practice, it's difficult to
    speculate as to how well it will actually work in real-world
    scenarios. What is the scope of what's covered by this atomic
    thing?

    Sounds a lot like transactional memory. Something that has
    yet to prove to be usable in the general case.

    I understand that some mallocs have done it; McRT-Malloc from
    Intel, if I'm not mistaken.

    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    In this case, I'm probably going to want to take a lock on the
    queue, then lock the list, then pop the first element off of the
    queue by taking a pointer to the head.

    Which can be done with compare and exchange, atomically;
    a lock may not be necessary.

    The insert into an ordered list will likely require a
    spin lock (or a reader-writer lock of some form).

    I'm sure one could do a lockless pop using a cmp exchange, but I
    wanted to a scenario where one would want to hold onto both
    locks throughout the critical section for some reason.
    I just don't see how this works in the proposed scenario.

    You want to hold lock A while lock B us being used? That can be tricky.
    Well, there is a locking order.

    Architecturally, we implemented locks in the hardware in the Burroughs
    V-Series mainframes. Each lock was assigned (by the programmer) a 'canonical lock number' (CLN) which was a four digit value between 0 and 9999. The
    CPU would track the currently active CLN and fault any lock instruction
    that attempted to acquire a lock with a CLN less than or equal to
    the current CLN. Unlocks would fault if the locks were not released
    in the reverse order. This guaranteed deadlock-free locking, but
    complicated the programming - with lots of locks, assigning
    the CLN to a particular lock was tricky. We used it in the operating
    system (four-cpu system) very successfully.

    There were similar instructions to implement condition variables
    with wait and signal semantics.

    A small microkernel was responsible for scheduling, if a lock
    was already held, the thread would be added to a waiting list
    and a runnable thread dispatched. microkernel interrupts were
    relatively low cost compared with normal interrupts. A single
    instruction was used to dispatch the next task (BRV - Branch
    Virutal Reinstate).

    When a lock was unlocked, and there was a pending waiter, the
    instruction would trap to the microkernel to handle the case
    and place the waiting thread on a runnable queue (by priority).

    When a lock acquisition failed because another thread owned
    the lock, the instruction would trap to the microkernel which
    would dispatch the next runnable task.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to ThatWouldBeTelling@thevillage.com on Mon Mar 24 20:11:46 2025
    In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:

    Sometimes you really don't want to be interrupted.
    And sometimes you don't want to be interrupted unless the
    "house is on fire"; I cannot see a time when "the house is
    on fire" that you would not want to take the interrupt.

    Is there one ?!?

    Consider a thread that takes a spinlock; suppose some
    high-priority interrupt comes in while the thread is holding
    that lock. In response to the interrupt, software decides to
    suspend the thread and switch some other thread; that thread
    wants to lock the spin lock that the now-descheduled thread is
    holding: a classic deadlock scenario.

    Terminology: mutexes coordinate mutual exclusion between threads,
    spinlocks coordinate mutual exclusion between cpu cores.
    Windows "critical sections" are mutexes with a fast path.

    A spin lock is simply any lock where you spin trying to acquire
    the lock, as opposed to a blocking synchronization protocol.
    Here I'm using the terminology of Herlihy and Shavit [Her08].

    Traditional Unix "sleep" and "wakeup" are an example of a
    blocking protocol, where a thread may "sleep" on some "channel",
    yielding the locking thread while the lock cannot be acquired,
    presumably scheduling something else until the thread is later
    marked runnable by virtual of something else calling "wakeup" on
    the sleep channel.

    But note that I am deliberately simplifying in order to
    construct a particular scenario in which a light-weight
    synchronization primitive is being used for mutual exclusion
    between concurrent entities, hence mentioning spinlocks
    specifically.

    Suggesting that spin locks iare only applicable to
    multiprocessing scenarios is flatly incorrect. Software may use
    the technique to spin on a "busy" bit on some IP in a device for
    example, even on a uniprocessor system.

    A valid response here might be, "don't context switch from the
    interrupt handler; use a DPC instead". That may be valid, but
    it puts a constraint on the software designer that may be onerus
    in practice: suppose the interrupt happens in the scheduler,
    while examining a run queue or something. A DPC object must be
    available, etc.

    That is exactly what DPC/SoftIrq are design to do - allow the bulk of
    the kernel, like the scheduler, non-paged heap management, IO pre and
    post processing, to be interrupted without causing reentrancy.
    The higher level device interrupt enqueues a request to the lower software >interrupt level, which is processed when it will not cause reentrancy.

    That constraint is by design and any good OS will crash if violated.

    Yes, I'm aware of the technique. That wasn't the point.

    Further, software must now consider the complexity of
    potentially interruptable critical sections. From the
    standpoint of reasoning about already-complex concurrency issues
    it's simpler to be able to assert that (almost) all interrupt
    delivery can be cheaply disabled entirely, save for very
    special, specific, scenarios like NMIs. Potentially switching
    away from a thread holding a spinlock sort of defeats the
    purpose of a spinlock in the first place, which is a mutex
    primitive designed to avoid the overhead of switching.

    This is why I mentioned the terminology thing: threads do not hold
    spinlocks, they hold mutexes.

    See above. Threads can certainly "hold" a spin lock, as they
    can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
    page 202:

    |On a uniprocessor, if a thread tries to acquire a spin lock
    |that is already held, it will loop forever. Multiprocessor
    |algorithms, however, must operate correctly regardless of the
    |number of processors, which means that they should handle the
    |uniprocessor case as well. This requires strict adherence to
    |the rule that threads not relinquish control of the CPU while
    |holding a spin lock.

    Threads and mutexes are a fiction created
    by the OS scheduler, and switching threads while waiting for a mutex
    is exactly what they are supposed to do.

    Spinlocks are used by cores to coordinate access to shared memory by
    things like a OS scheduler looking at list of threads waiting for a mutex.

    Spinlocks, of the form I was describing, are an optimization for
    cases where the critical section guarded by the lock is short,
    and the overhead of invoking a scheduler and context switching
    to some other thread is higher than just spinning until the lock
    becomes available again. The whole point is to avoid blocking
    and switching.

    E.g., as [Vaj11] puts it (page 98):

    |Spin locks have the advantage of preventing pre-emption by the
    |OS (due to blocking while waiting for the lock)

    That is, the advantage is because the thread does not block and
    yield.

    I like to think of it as all having to do with hats.
    The cpu is wearing one of three hats: a thread hat when it pretends to be
    a time shared execution machine; a core hat when it acts as a single >execution machine running non-reentrant things like a scheduler which
    creates the fiction of threads and processes (multiple virtual spaces);
    and an interrupt hat when executing nestable reentrant interrupts.

    The interrupt mechanism coordinates exactly when and how we switch hats.

    I suppose you are lumping system calls into the overall bucket
    of "interrupts" in this model, regardless of whether one might
    use a dedicated supervisor call instruction (e.g., something
    like x86 SYSENTER or SYSCALL or ARM SVC or RISC-V's ECALL)?

    The hat analogy feels strained to me, and too colloquial to be
    useful.

    - Dan C.

    References:

    [Her08] Maurice Herlihy and Nir Shavit. 2008. _The Art of
    Multiprocessor Programming_. Morgan Kaufmann, Burlington, MA.

    [Vah96] Uresh Vahalia. 1996. _Unix Internals: The New
    Frontiers_. Prentice Hall, Upper Saddle River, NJ.

    [Vaj11] Andras Vajda. 2011. _Programming Many-Core Chips_.
    Springer, New York, NY.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Chris M. Thomasson on Mon Mar 24 20:14:47 2025
    In article <vrsaro$1faj3$9@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/24/2025 11:49 AM, Dan Cross wrote:
    [snip]
    I'm sure one could do a lockless pop using a cmp exchange, but I
    wanted to a scenario where one would want to hold onto both
    locks throughout the critical section for some reason.
    I just don't see how this works in the proposed scenario.

    You want to hold lock A while lock B us being used? That can be tricky.

    Really? This happens all the time. Of course it can be tricky,
    as concurrent programming always is, but it's incredibly normal.

    Well, there is a locking order. Check out this older work of mine, multex:

    https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

    Not sure if this is relevant or not.

    It doesn't appear to be; sorry.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Mon Mar 24 20:28:40 2025
    In article <96987d473656f296ec63c30626a46730@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 24 Mar 2025 18:02:25 +0000, Stefan Monnier wrote:

    MitchAlsup1 [2025-03-24 17:36:28] wrote:
    On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:
    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    [...]
    I have almost that in the documentation for my ATOMIC stuff::
    The following code moves a doubly linked element from one place
    in a concurrent data structure to another without any interested
    3rd party being able to see that the element is now within the
    CDS at any time::

    Would your "ATOMIC stuff" really handle Dan's example, tho?
    A key element in Dan's example if the "ordered singly-linked list",
    i.e. you need to walk the list an unbounded number of steps until you
    can find the insertion point, so I don't think you can do the whole
    thing inside your ATOMIC stuff.

    Technically, yes, you can put a search loop in an ATOOMIC event.
    Whether you want to or not is a different story. My example
    illustrated the ATOMIC manipulation of the event without a search.

    In addition, if the search it long, code developer should either
    to the search before the event, or grab a Mutex to prevent
    interference during the search and rest of the critical region.

    The problem, as specified, excluded this idea that the list was
    long.

    The point was simply to show a scenario where your atomic
    operation proposal breaks down, and where you might want to
    disable interrupt delivery. The point wasn't to ask whether
    the specific problem could be recast in a different way; it
    probably can. The question was to ask whether your proposal
    could support that specific problem. I don't see how it can.

    IOW, you'd have to somehow move the list-walk outside of the atomic
    section.

    Yes,

    E.g. if you know elements are only ever removed from the tail of the
    list, you can presumably walk the list outside of the ATOMIC stuff and
    then do the "remove from the queue and insert into the list" atomically
    since it's a small&finite number of steps (with a test to loop back to
    the list-walk if someone else inserted or removed anything at that point
    in the mean time).

    Or maintain a tail pointer.

    There's no guarantee that the tail is the correct place to
    insert the element, vis the order of the elements.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Mon Mar 24 22:29:23 2025
    On Mon, 24 Mar 2025 20:28:40 +0000, Dan Cross wrote:

    In article <96987d473656f296ec63c30626a46730@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 24 Mar 2025 18:02:25 +0000, Stefan Monnier wrote:

    MitchAlsup1 [2025-03-24 17:36:28] wrote:
    On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:
    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    [...]
    I have almost that in the documentation for my ATOMIC stuff::
    The following code moves a doubly linked element from one place
    in a concurrent data structure to another without any interested
    3rd party being able to see that the element is now within the
    CDS at any time::

    Would your "ATOMIC stuff" really handle Dan's example, tho?
    A key element in Dan's example if the "ordered singly-linked list",
    i.e. you need to walk the list an unbounded number of steps until you
    can find the insertion point, so I don't think you can do the whole
    thing inside your ATOMIC stuff.

    Technically, yes, you can put a search loop in an ATOOMIC event.
    Whether you want to or not is a different story. My example
    illustrated the ATOMIC manipulation of the event without a search.

    In addition, if the search it long, code developer should either
    to the search before the event, or grab a Mutex to prevent
    interference during the search and rest of the critical region.

    The problem, as specified, excluded this idea that the list was
    long.

    The point was simply to show a scenario where your atomic
    operation proposal breaks down, and where you might want to
    disable interrupt delivery.

    My ATOMIC stuff allows for the construction, in SW, of pretty
    much any ATOMIC primitive anyone would like. It is, in general,
    not suited to guard critical regions, but to grab the lock which
    does.

    Most uses of My ATOMIC stuff is to grab/change/update data
    that then guards various critical regions. More like a multi-
    valued version of LDL-STC.

    The point wasn't to ask whether
    the specific problem could be recast in a different way; it
    probably can. The question was to ask whether your proposal
    could support that specific problem. I don't see how it can.

    To the question posed, you should just grab the lock/mutex.

    IOW, you'd have to somehow move the list-walk outside of the atomic
    section.

    Yes,

    E.g. if you know elements are only ever removed from the tail of the
    list, you can presumably walk the list outside of the ATOMIC stuff and
    then do the "remove from the queue and insert into the list" atomically
    since it's a small&finite number of steps (with a test to loop back to
    the list-walk if someone else inserted or removed anything at that point >>> in the mean time).

    Or maintain a tail pointer.

    There's no guarantee that the tail is the correct place to
    insert the element, vis the order of the elements.

    If the element pointer updates and the tail pointer update are inside
    one of my ATOMIC events, then you are sure that the tail pointer is
    always pointing at the last element in the list (if there is one).

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Mon Mar 24 22:50:09 2025
    On Mon, 24 Mar 2025 20:11:46 +0000, Dan Cross wrote:

    In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    -------------------
    This is why I mentioned the terminology thing: threads do not hold >>spinlocks, they hold mutexes.

    See above. Threads can certainly "hold" a spin lock, as they
    can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
    page 202:

    |On a uniprocessor, if a thread tries to acquire a spin lock
    |that is already held, it will loop forever. Multiprocessor
    |algorithms, however, must operate correctly regardless of the
    |number of processors, which means that they should handle the
    |uniprocessor case as well. This requires strict adherence to
    |the rule that threads not relinquish control of the CPU while
    |holding a spin lock.

    My 66000 ATOMIC stuff is designed such that if control has to leave
    the ATOMIC event, all HW knowledge of being in that event disappears
    and IP is modified (prior to control transfer) such that upon return
    SW knows the event failed (or never started), and that no update of participating data in the event ever becomes visible.

    HW can do this with address-based events, but not with data-based
    events. LDL-STC are address-based events used to manipulate data-
    based events. Mine just allow more LDs and STs in the event than 1.

    So, if you want the property whereby the lock disappears on any
    control transfer out of the event {exception, interrupt, SVC, SVR, ...};
    then you want to use my ATOMIC stuff; otherwise, you can use the
    normal ATOMIC primitives everyone and his brother provide.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Mon Mar 24 23:45:49 2025
    In article <a86295425dfb043a1b44e550a1c05659@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 24 Mar 2025 20:11:46 +0000, Dan Cross wrote:

    In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    -------------------
    This is why I mentioned the terminology thing: threads do not hold >>>spinlocks, they hold mutexes.

    See above. Threads can certainly "hold" a spin lock, as they
    can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
    page 202:

    |On a uniprocessor, if a thread tries to acquire a spin lock
    |that is already held, it will loop forever. Multiprocessor
    |algorithms, however, must operate correctly regardless of the
    |number of processors, which means that they should handle the
    |uniprocessor case as well. This requires strict adherence to
    |the rule that threads not relinquish control of the CPU while
    |holding a spin lock.

    My 66000 ATOMIC stuff is designed such that if control has to leave
    the ATOMIC event, all HW knowledge of being in that event disappears
    and IP is modified (prior to control transfer) such that upon return
    SW knows the event failed (or never started), and that no update of >participating data in the event ever becomes visible.

    HW can do this with address-based events, but not with data-based
    events. LDL-STC are address-based events used to manipulate data-
    based events. Mine just allow more LDs and STs in the event than 1.

    So, if you want the property whereby the lock disappears on any
    control transfer out of the event {exception, interrupt, SVC, SVR, ...};
    then you want to use my ATOMIC stuff; otherwise, you can use the
    normal ATOMIC primitives everyone and his brother provide.

    I definitely do Not want that property. Having a lock
    "disappear" on delivery of an interrupt seems like it would be a
    great way to introduce silent corruption.

    If your atomic primitives are not sufficient to cover an entire
    critical section, as I believe we have established, and are
    mostly intended to be used for building concurrency primitives
    like locks, then I expect that software would take the lock, see
    that it succeeded, and enter the critical section. If at that
    point the lock were silently discarded on receipt of an
    interrupt, then potentially some other thread could subsequently
    take the lock, enter the critical section, and observe a data
    structure in an inconsistent state (leading to a crash in the
    best case). This is precisely the sort of thing we build mutual
    exclusion primitives to wrap around critical sections to avoid.

    If, on the other hand, the atomic primitives are just meant for
    the spin loop, then I don't really see how that's all that
    useful compared to LL/SC. And you still need some way to ensure
    that the critical section is not interrupted.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Mon Mar 24 23:27:34 2025
    In article <684d32898ec85b91bcb9dcdb97d8065a@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 24 Mar 2025 20:28:40 +0000, Dan Cross wrote:
    In article <96987d473656f296ec63c30626a46730@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 24 Mar 2025 18:02:25 +0000, Stefan Monnier wrote:

    MitchAlsup1 [2025-03-24 17:36:28] wrote:
    On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:
    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    [...]
    I have almost that in the documentation for my ATOMIC stuff::
    The following code moves a doubly linked element from one place
    in a concurrent data structure to another without any interested
    3rd party being able to see that the element is now within the
    CDS at any time::

    Would your "ATOMIC stuff" really handle Dan's example, tho?
    A key element in Dan's example if the "ordered singly-linked list",
    i.e. you need to walk the list an unbounded number of steps until you
    can find the insertion point, so I don't think you can do the whole
    thing inside your ATOMIC stuff.

    Technically, yes, you can put a search loop in an ATOOMIC event.
    Whether you want to or not is a different story. My example
    illustrated the ATOMIC manipulation of the event without a search.

    In addition, if the search it long, code developer should either
    to the search before the event, or grab a Mutex to prevent
    interference during the search and rest of the critical region.

    The problem, as specified, excluded this idea that the list was
    long.

    The point was simply to show a scenario where your atomic
    operation proposal breaks down, and where you might want to
    disable interrupt delivery.

    My ATOMIC stuff allows for the construction, in SW, of pretty
    much any ATOMIC primitive anyone would like. It is, in general,
    not suited to guard critical regions, but to grab the lock which
    does.

    Ok, that's good to understand.

    Most uses of My ATOMIC stuff is to grab/change/update data
    that then guards various critical regions. More like a multi-
    valued version of LDL-STC.

    I don't see how constructing the primitive that way is terribly
    useful compared to just issuing multiple load-locked/store-cond
    instructions; in particular, it doesn't seem like the sort of
    thing that composes well: every sequence that uses it must be
    coded to use it, explicitly, instead of just nesting calls to
    "lock".

    The point wasn't to ask whether
    the specific problem could be recast in a different way; it
    probably can. The question was to ask whether your proposal
    could support that specific problem. I don't see how it can.

    To the question posed, you should just grab the lock/mutex.

    That's fine, then. But this is another example of a place where
    one probably wants to disable interrupts while in the critical
    section, to avoid being preempted while holding the lock.

    IOW, you'd have to somehow move the list-walk outside of the atomic
    section.

    Yes,

    E.g. if you know elements are only ever removed from the tail of the
    list, you can presumably walk the list outside of the ATOMIC stuff and >>>> then do the "remove from the queue and insert into the list" atomically >>>> since it's a small&finite number of steps (with a test to loop back to >>>> the list-walk if someone else inserted or removed anything at that point >>>> in the mean time).

    Or maintain a tail pointer.

    There's no guarantee that the tail is the correct place to
    insert the element, vis the order of the elements.

    If the element pointer updates and the tail pointer update are inside
    one of my ATOMIC events, then you are sure that the tail pointer is
    always pointing at the last element in the list (if there is one).

    But that's irrelevant. The goal is to insert into a sorted list
    and, again, the last element is not necessarily the correct
    place to insert (or append).

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Tue Mar 25 01:53:00 2025
    On Mon, 24 Mar 2025 23:45:49 +0000, Dan Cross wrote:

    In article <a86295425dfb043a1b44e550a1c05659@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 24 Mar 2025 20:11:46 +0000, Dan Cross wrote:

    In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    -------------------
    This is why I mentioned the terminology thing: threads do not hold >>>>spinlocks, they hold mutexes.

    See above. Threads can certainly "hold" a spin lock, as they
    can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
    page 202:

    |On a uniprocessor, if a thread tries to acquire a spin lock
    |that is already held, it will loop forever. Multiprocessor
    |algorithms, however, must operate correctly regardless of the
    |number of processors, which means that they should handle the
    |uniprocessor case as well. This requires strict adherence to
    |the rule that threads not relinquish control of the CPU while
    |holding a spin lock.

    My 66000 ATOMIC stuff is designed such that if control has to leave
    the ATOMIC event, all HW knowledge of being in that event disappears
    and IP is modified (prior to control transfer) such that upon return
    SW knows the event failed (or never started), and that no update of >>participating data in the event ever becomes visible.

    HW can do this with address-based events, but not with data-based
    events. LDL-STC are address-based events used to manipulate data-
    based events. Mine just allow more LDs and STs in the event than 1.

    So, if you want the property whereby the lock disappears on any
    control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>then you want to use my ATOMIC stuff; otherwise, you can use the
    normal ATOMIC primitives everyone and his brother provide.

    I definitely do Not want that property. Having a lock
    "disappear" on delivery of an interrupt seems like it would be a
    great way to introduce silent corruption.

    It means that if a thread owning a lock is interrupted, that
    thread gives the lock back--so there is no priority inversion
    and no lock-starvation. Associated with the give back, and the
    IP of the interrupted thread no longer points inside the
    critical section.

    Nothing gets lost.

    {{Perhaps disappear was a poorly chosen word.}}

    If your atomic primitives are not sufficient to cover an entire
    critical section, as I believe we have established, and are
    mostly intended to be used for building concurrency primitives
    like locks, then I expect that software would take the lock, see
    that it succeeded, and enter the critical section.

    If at that
    point the lock were silently discarded on receipt of an
    interrupt,

    At this point your thread in not IN the critical section !!

    The closest you can be is at the first instruction beginning
    an ATOMIC event (the never happened case) or you can also
    be at the point where you KNOW that some other interfered
    with your actions, and that they saw a completely consistent
    state--now it is your turn (again) observing a completely
    consistent but new state.

    then potentially some other thread could subsequently
    take the lock,

    Since you are not in the critical section, you don't care.
    No critical data has been modified, there is no inconsistent
    state to be observed; and YOUR Instruction Pointer is not
    pointing into instructions within the ATOMIC event !!

    enter the critical section, and observe a data
    structure in an inconsistent state (leading to a crash in the
    best case).

    There is no inconsistent state--in rather the same way a core backs
    up a branch mispredict, the core backs up the ATOMIC event to appear
    as if it never even started !!

    This is precisely the sort of thing we build mutual
    exclusion primitives to wrap around critical sections to avoid.

    And also why those primitives give rise to priority inversion
    and lock-starvation.

    If, on the other hand, the atomic primitives are just meant for
    the spin loop, then I don't really see how that's all that
    useful compared to LL/SC.

    Any number of respected people have ask for a pipelined LL-SC;
    where one can LL multiple datums and later SC multiple datums.
    This is what ESM can provide.

    And you still need some way to ensure
    that the critical section is not interrupted.

    Just a way to detect if it happened; and if it happens;
    means to back upto a point where it does not matter if
    interference happened "right there"--that is: outside of
    a critical section.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Mar 25 01:26:00 2025
    It means that if a thread owning a lock is interrupted, that
    thread gives the lock back--so there is no priority inversion
    and no lock-starvation. Associated with the give back, and the
    IP of the interrupted thread no longer points inside the
    critical section.

    I think the discussion is a bit confusing because we have 2 different
    notions of "critical section" at play, here:

    Mitch's ESM/ATOMIC provides a kind of middle-point between LL/SC and transactional memory, which handles only "very small" atomic sequences
    (IIUC there's no limit to the number of instructions, but there's
    a limit of 8 data cache lines touched).

    A "critical section" can refer to such an atomic sequence. But that can
    be used only for critical sections of limited size, so as you point out
    it can't really be composed. For this reason, as Mitch said, it's meant
    mostly as a tool to implement synchronization primitives (lock,
    semaphore, read/write lock, younameit), tho it seems sufficiently
    flexible that it should also be usable for some lock-free operations.

    A "critical section" can also refer to the arbitrary code protected by
    a lock, where the lock primitives would be presumably themselves
    implemented using ESM and would then each be made of their own "critical section".

    The small ATOMIC-style critical sections don't hold locks. Instead they
    use an optimistic-synchronization approach, which is why they don't
    suffer from priority inversion.

    The lock-based critical sections you can build on top of the ATOMIC-style critical sections, OTOH, are just your usual lock-based critical
    sections: the fact that they're built out of ESM primitives rather than
    out of test&set or LL/SC doesn't really make any difference: they can
    indeed suffer from priority inversion.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Mar 18 18:51:50 2025
    On Tue, 18 Mar 2025 1:34:40 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:

    Interrupt latency is dependent on executing instructions,
    Lowest waste of power

    But note: In most cases, it already took the interrupt ~150 nanoseconds
    to arrive at the interrupt service port. 1 trip from device to DRAM >>(possibly serviced by L3), 1 trip from DRAM back to device, 1 tip from >>device to interrupt service port; and 4 DRAM (or L3) accesses to log >>interrupt into table.

    How do you handle timer interrupts? Those are generally not
    message based, but rather internal signals within the core that
    raise an interrupt (perhaps coordinating with the interrupt controller
    for priority, security state, and enable bit). These are generally
    wired (and the enables/state/priority are flops) for low latency.

    Wired interrupts are converted into MSI-X interrupts under control
    of INTk->MSI-X control registers in HostBridge. Then these
    smell just like any other interrupt and take similar paths.

    There will generally be at least one general purpose timer per
    core plus any watchdog timers.

    There are 8 performance counting "things" per major block on die.
    major block include {core[*], L2[*], interconnect, L3[*], DRAM[*]
    config[*] and MMI/O[*]}.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Tue Mar 18 14:51:57 2025
    Robert Finch wrote:
    On 2025-03-17 2:33 p.m., EricP wrote:

    Another problem is what does the core do with the in flight instructions.


    Method 2 pipelines the switch by injecting the interrupt request at
    Fetch.
    Decode converts the request to a special uOp that travels down the IQ
    to Retire and allows all the older work to complete.
    This is more complex as it requires a two phase hand-off from the
    Interrupt Control Unit (ICU) to the core as a branch mispredict in the
    in flight instructions might cause a tentative interrupt acceptance to
    later
    be withdrawn.

    Oh my, I forgot.
    Method 2 will be used as it is desired to able to insert either an instruction at the fetch stage or a vector address.

    The ICU believes the core is in a state to accept a higher priority
    interrupt. It sends a request to core, which checks its current state and
    sends back an immediate INT_ACK if _might_ accept and stalls Fetch, or
    a NAK.
    When the special uOp reaches Retire, it sends a signal to Fetch which
    then sends an INT_ACCEPT signal to ICU to complete the handoff.
    If a branch mispredict occurs that causes interrupts to be disabled,
    then Fetch sends an INT_REJECT to ICU, and unstalls its fetching.
    (Yes that is not optimal - make it work first, make it work well second.)

    This also raises a question about what the ICU is doing during this
    long latency handoff. One wouldn't want ICU to sit idle so it might
    have to manage the handoff of multiple interrupts to multiple cores
    at the same time, each as its own little state machine.

    One should see that this decision on how the core handles the
    handoff has a large impact on the design complexity of the ICU.

    Personally, I would use method 1 first to get something working
    then if this is OoO think about getting fancy with method 2.


    Would it not be rare for this to occur? If so, I think the pipeline
    could just be flushed from the int disabling instruction on. Then the interrupted address recorded as the int disabling instruction. <- this requires identification of an int disabling instruction though. It might
    just be a store to a specific I/O address, therefore a special store
    opcode to make it easy to detect?

    It likely is rare but the possibility that the existing instructions,
    a branch mispredict or an exception throw, could retroactively change
    the state of the interrupt enable flag to disabled creates this
    Window of Uncertainty (WoU) that is the length of the pipeline.

    Unless one can find a way to make WoU go away the ICU needs to deal
    with the possibility that there could be hundreds of clocks between
    when ICU asks a core if it can accept an interrupt and when the core does.

    I had a few ideas walking in the park yesterday. I had been assuming that
    when an exception is delivered that it would also automatically disable interrupts. That may not be a good idea as it means that any instruction
    that can throw an exception could cause this retroactive disabling.

    If I remove that exception side effect then the only things that can
    manipulate the core's master Interrupt Enable flag are the enable INTEN
    and disable INTDI instructions, and the actual delivery of an interrupt.
    And those three things can be tracked with a counter in Decode as they
    move down the pipeline to Retire. If the counter >0 then there is a
    flag change in flight and we don't accept a new interrupt. When count == 0
    and Interrupt Enable flag == 1 then core accepts interrupts, which are
    injected at Decode and cause a light weight purge of the Fetch buffer.

    Another pipeline issue is Privileged Control Register (PCR) reads and writes. PCR's don't behave like general registers in that they are not renamed,
    and in general operations cannot be replayed or performed out of order.
    They behave more like atomic stores that must be performed at Retire
    and must not be bypassed by any memory and other PCR ops. (This is why
    many of the x86/x64 MSR ops are serialized with a pipeline drain.)

    This affects what it can do with PCR inside the WoU if it causes a state
    change that cannot be undone, for example, reading the interrupt data FIFO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Tue Mar 18 19:19:41 2025
    On Mon, 17 Mar 2025 21:49:18 +0000, MitchAlsup1 wrote:

    On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:

    Michael S wrote:
    On Mon, 17 Mar 2025 13:38:12 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:
    ------------------

    The problem Robert is talking about arises when there are many
    interrupt source and many target CPUs.
    The required routing/prioritization/acknowledgment logic (at least a
    naive logic I am having in mind) would be either non-scalable or
    relatively complicated. Process of selection for the second case will
    take multiple cycles (I am thinking about ring).

    Another problem is what does the core do with the in flight
    instructions.

    Method 1 is simplest, it injects the interrupt request at Retire
    as that's where the state of everything is synchronized.
    The consequence is that, like exceptions, the in flight instructions all
    get purged, and we save the committed RIP, RSP and interrupt control
    word.
    While that might be acceptable for a 5 stage in-order pipeline,
    it could be pretty expensive for an OoO 200+ instruction queue
    potentially tossing hundreds of cycles of near finished work.

    Lowest interrupt Latency
    Highest waste of power (i.e., work)

    Method 2 pipelines the switch by injecting the interrupt request at
    Fetch.
    Decode converts the request to a special uOp that travels down the IQ
    to Retire and allows all the older work to complete.
    This is more complex as it requires a two phase hand-off from the
    Interrupt Control Unit (ICU) to the core as a branch mispredict in the
    in flight instructions might cause a tentative interrupt acceptance to
    later be withdrawn.

    Interrupt latency is dependent on executing instructions,
    Lowest waste of power
    Highest chance of mucking it all up

    But note: In most cases, it already took the interrupt ~150 nanoseconds
    to arrive at the interrupt service port. 1 trip from device to DRAM
    (possibly serviced by L3), 1 trip from DRAM back to device, 1 tip from
    device to interrupt service port; and 4 DRAM (or L3) accesses to log interrupt into table.

    Also, in most cases, the 200-odd instructions in the window will finish
    in 100-cycles or as little as 20ns--but if the FDIV unit is saturated, interrupt latency could be as high as 640 cycles and as long as 640ns.

    There is the other issue adding complexity to method 2,
    The instructions in the window may raise exceptions.

    The general rule is that the instruction raising the first exception
    has its address in IP, and then the retire logic has to flush subseq-
    uent instructions until it arrives at the first instruction of the
    Interrupt dispatcher.

    The exception will be handled when control returns from interruption.


    The ICU believes the core is in a state to accept a higher priority
    interrupt. It sends a request to core, which checks its current state
    and
    sends back an immediate INT_ACK if _might_ accept and stalls Fetch, or a
    NAK.

    In My 66000, ICU knows nothing about the priority level (or state)
    of any core in the system. Instead, when a new higher priority
    interrupt is raised, the ISP broadcasts a 64-bit mask indicating
    which priority levels in the interrupt table have pending inter-
    rupts with an MMI/O message to the address of the interrupt table.

    All cores monitoring that interrupt table capture the broadcast,
    and each core decides to negotiate for an (not that) interrupt
    by requesting the highest priority interrupt from the table.

    When the request returns, and it is still at a higher priority
    than the core is running, core performs interrupt control transfer.
    If the interrupt is below the core's priority it is returned to
    ISP as if NAKed.

    Prior to interrupt control transfer, core remains running what-
    ever it was running--and all the interrupt stuff is done by state
    machines at the edge of the core and the L3/DRAM controller.

    When the special uOp reaches Retire, it sends a signal to Fetch which
    then sends an INT_ACCEPT signal to ICU to complete the handoff.
    If a branch mispredict occurs that causes interrupts to be disabled,
    then Fetch sends an INT_REJECT to ICU, and unstalls its fetching.
    (Yes that is not optimal - make it work first, make it work well
    second.)

    This also raises a question about what the ICU is doing during this
    long latency handoff. One wouldn't want ICU to sit idle so it might
    have to manage the handoff of multiple interrupts to multiple cores
    at the same time, each as its own little state machine.

    One must assume that ISP is capable of taking a new interrupt
    from a device every 5-ish cycles and interrupt handoff is in the
    range of 50 cycles, and that each interrupt could be to a different
    interrupt table.

    My 66000 ISP treats successive requests to any one table as strongly
    ordered, and requests to different tables as completely unordered.

    One should see that this decision on how the core handles the
    handoff has a large impact on the design complexity of the ICU.

    I did not "see" that in My 66000's interrupt architecture. The ISP
    complexity is fixed, and the core's interrupt negotiator is a small
    state machine (~10-states).

    ISP essentially performs 4-5 64-bit memory accesses, and possibly
    1 MMI/O 64-bit broadcast on arrival of MSI-X interrupt. Then if
    a core negotiates, it performs 3 more memory accesses per negotiator.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Mar 18 22:49:26 2025
    On Tue, 18 Mar 2025 18:51:57 +0000, EricP wrote:

    Robert Finch wrote:
    On 2025-03-17 2:33 p.m., EricP wrote:
    ----------------
    This also raises a question about what the ICU is doing during this
    long latency handoff. One wouldn't want ICU to sit idle so it might
    have to manage the handoff of multiple interrupts to multiple cores
    at the same time, each as its own little state machine.

    One should see that this decision on how the core handles the
    handoff has a large impact on the design complexity of the ICU.

    Personally, I would use method 1 first to get something working
    then if this is OoO think about getting fancy with method 2.


    Would it not be rare for this to occur? If so, I think the pipeline
    could just be flushed from the int disabling instruction on. Then the
    interrupted address recorded as the int disabling instruction. <- this
    requires identification of an int disabling instruction though. It might
    just be a store to a specific I/O address, therefore a special store
    opcode to make it easy to detect?

    It likely is rare but the possibility that the existing instructions,

    Which is rare:
    executing 200 instructions without an exception ?
    OR
    executing 200 instructions without a mispredict ?

    The former is fairly rare, the later is not so rare.

    a branch mispredict or an exception throw, could retroactively change
    the state of the interrupt enable flag to disabled creates this
    Window of Uncertainty (WoU) that is the length of the pipeline.

    All the more reasons that the very vast majority of interrupts and
    exceptions do not have to disable more of their kind or their
    opposite.

    In My 66000 interrupt architecture, the very first instruction in the
    interrupt dispatcher can take a higher priority interrupt-----the
    page fault handler in Guest OS can take a page fault in HyperVisor
    MMU tables, without any special precautions or code sequences.

    Unless one can find a way to make WoU go away the ICU needs to deal
    with the possibility that there could be hundreds of clocks between
    when ICU asks a core if it can accept an interrupt and when the core
    does.

    The window will always exist--so, the best one can do is eliminate
    the core's sensitivity to the WoU.

    I had a few ideas walking in the park yesterday. I had been assuming
    that
    when an exception is delivered that it would also automatically disable interrupts. That may not be a good idea as it means that any instruction
    that can throw an exception could cause this retroactive disabling.

    My 66000 simply defines that this cannot happen::
    Exceptions do not alter the behavior of interrupts;
    Interrupts do not alter the behavior of exceptions.

    If I remove that exception side effect then the only things that can manipulate the core's master Interrupt Enable flag are the enable INTEN
    and disable INTDI instructions, and the actual delivery of an interrupt.

    Let's postulate we have a core running at low priority and its
    instruction
    stream raises an exception. That exception delivers control at that same
    low priority to the exception handler.

    Do you want any of this to slow down a high priority interrupt ??
    even by 20-50 cycles ??

    I submit the answer is:: NO !

    And those three things can be tracked with a counter in Decode as they
    move down the pipeline to Retire. If the counter >0 then there is a
    flag change in flight and we don't accept a new interrupt. When count ==
    0
    and Interrupt Enable flag == 1 then core accepts interrupts, which are injected at Decode and cause a light weight purge of the Fetch buffer.

    Once you start adding mechanisms to track this and that and to deal
    with different this and that--you should receive an indication that
    you are not on the correct path.

    {{Hint: make the problem go away for all time}}

    Another pipeline issue is Privileged Control Register (PCR) reads and
    writes.
    PCR's don't behave like general registers in that they are not renamed,
    and in general operations cannot be replayed or performed out of order.

    While I have a way to read and write control registers with varying
    privilege, a thread at privilege[k] can only read and write his own
    CRs and those of the core on which it is currently running.

    If you want to read/write control register data for lower privilege
    threads, you can access the pointer in core.CR and then access
    thread-state directly in memory. By the time control returns to the
    low privilege level, all CRs will have been made resident in core.

    That is the per thread Program-Status-Line is treated like a write
    back cache.

    They behave more like atomic stores that must be performed at Retire
    and must not be bypassed by any memory and other PCR ops. (This is why
    many of the x86/x64 MSR ops are serialized with a pipeline drain.)

    By converting them to real memory accesses, the problem solves itself.

    This affects what it can do with PCR inside the WoU if it causes a state change that cannot be undone, for example, reading the interrupt data
    FIFO.

    Since the device is accessed through MMI/O space, and since My 66000
    demands sequential consistency for MMI/O accesses, the LD of the
    interrupt data FIFO cannot leave the confines of the core until
    all pending memory references from the interrupted thread have
    "been performed".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Tue Mar 18 23:00:56 2025
    On Tue, 18 Mar 2025 21:59:51 +0000, Robert Finch wrote:

    On 2025-03-18 2:51 p.m., EricP wrote:
    ----------------
    Would it not be rare for this to occur? If so, I think the pipeline
    could just be flushed from the int disabling instruction on. Then the
    interrupted address recorded as the int disabling instruction. <- this
    requires identification of an int disabling instruction though. It
    might just be a store to a specific I/O address, therefore a special
    store opcode to make it easy to detect?

    It likely is rare but the possibility that the existing instructions,
    a branch mispredict or an exception throw, could retroactively change
    the state of the interrupt enable flag to disabled creates this
    Window of Uncertainty (WoU) that is the length of the pipeline.

    Unless one can find a way to make WoU go away the ICU needs to deal
    with the possibility that there could be hundreds of clocks between
    when ICU asks a core if it can accept an interrupt and when the core
    does.

    I had a few ideas walking in the park yesterday. I had been assuming
    that
    when an exception is delivered that it would also automatically disable
    interrupts. That may not be a good idea as it means that any instruction
    that can throw an exception could cause this retroactive disabling.

    If I remove that exception side effect then the only things that can
    manipulate the core's master Interrupt Enable flag are the enable INTEN
    and disable INTDI instructions, and the actual delivery of an interrupt.
    And those three things can be tracked with a counter in Decode as they
    move down the pipeline to Retire. If the counter >0 then there is a
    flag change in flight and we don't accept a new interrupt. When count ==
    0
    and Interrupt Enable flag == 1 then core accepts interrupts, which are
    injected at Decode and cause a light weight purge of the Fetch buffer.

    Seems like it could work. There are difficulties using counters with the
    Q+ ISA as multiple interrupt enable bits could be manipulated at the
    same time. One interrupt may be enabled and another one disabled by the
    same instruction.

    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    When running at any privilege, do you not want to accept an interrupt
    at higher priority with as little delay as possible ??

    Does Interrupt Disable help in this goal ??

    Is there a SANE way to avoid disabling of interrupts ??

    So, what does it take to take (sic) interrupts without disabling
    interrupts ?? It takes the interrupt control transfer sequence
    to leave the core in an interruptible state ! {{Same for exceptions}}

    My 66000 interrupt negotiator will not negotiate for interrupts
    at the same or lower priority the core is currently running.
    This is done by using the 6-bit priority as a less-than mask
    generator across the current-raised 64-bit control register.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Wed Mar 19 01:22:02 2025
    On Wed, 19 Mar 2025 0:39:04 +0000, Robert Finch wrote:

    On 2025-03-18 6:49 p.m., MitchAlsup1 wrote:
    <snip>

    I am not following this. I think there needs to be some way to disable interrupts that should not be occurring. I think a diagram would help. +-------+--------+-----+--------+
    | fetch | decode | ... | retire |
    +-------+--------+-----+--------+
    | IRQ^ | .... | ... | INTDI |
    +-------+--------+-----+--------+

    These two are happening at the same time.
    The INTDI (interrupt disable) should effectively
    cause the IRQ to be ignored.
    Instructions that are fetched after the IRQ
    are relying on interrupts disabled.

    a) there is a list of at least 7 places where interrupts can be disabled ..between the device[j] a core[k]
    b) what I am addressing is disablement at the core.
    c) any yes, I do have a way of disabling interrupts, but is generally ..reserved for nasty things like a system panic.
    d) the typical "I'm done with req[k]" needs no interrupt disable.
    e) certain architectures can choose to disable interrupts for a short
    while
    f) others can architect a system that does not need disablement.

    Now, back to your ASCII-art::

    If there is a interrupt disable instruction (or side effect) in the
    pipeline, as seen at DECODE, the disable has already happened, and
    the pipeline is just waiting for the event to become manifest. This
    is that window of uncertainty.

    Suppose there is an interrupt disable instruction in the pipeline that
    has not been performed yet

    As seen at DECODE it has already happened;
    as seen at RETIRE it remains pending.

    Different ends of the pipeline are allowed/required to know things
    differently.

    (they are not done until the retire stage in
    Q+). So, interrupts are enabled when detected at an earlier pipeline
    stage (fetch). It turns out that the interrupt should not have been
    fetched from the interrupt queue as interrupts are about to be disabled
    by an earlier instruction.

    This is that tangled web some weave.

    I think this matters because of the purpose
    of disabling the interrupts. There could be other instructions after the
    int disable including yet to be fetched instructions relying on the fact
    that interrupts are disabled.

    If one bothers to disable core from seeing interrupts in My 66000,
    The disable happens in program order, and is made to appear as if in
    program order.

    Likewise, if one enables core interrupts, the instruction following the
    enable can take an interrupt (before it executes).

    But you seem to be missing the point--it is possible (and I would argue beneficial) to architect your interrupts such that interrupt disable is
    not necessary >98& of the time. You do it when required, you don't when
    you can get away with out it.

    Completion interrupts ("I'm done with blah[k]") should not need to DIs.

    If the interrupt did not affect other aspects of the instruction being interrupted (it maybe just sets an interrupt bit in the pipeline
    register), I think it might be possible to handle this by having an
    interrupt enable flag in the pipeline registers.

    If you head in that direction, you need to transfer control to the ISR
    at the front of the pipeline, so that a mispredicted branch backs-up
    to the 1st instruction in ISR, and then throw the window away with the
    normal logic, leaving IP pointing at the branch or branch target.

    That way when the
    interrupt is disabled, all the interrupt enable flags in the pipeline registers could be cleared. When the disabled interrupt reaches retire,
    it would have to be logged (queued) so that it can be executed when interrupts are re-enabled.

    It is for all these reason, I counsel you to device a system where the
    typical interrupt needs to disable nothing.

    It gets worse with branches involved.

    Which is why you want greater orthogonality wrt interrupts and IE-bit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Chris M. Thomasson on Tue Mar 25 11:14:35 2025
    In article <vrt2sb$27c13$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/24/2025 1:14 PM, Dan Cross wrote:
    In article <vrsaro$1faj3$9@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/24/2025 11:49 AM, Dan Cross wrote:
    [snip]
    I'm sure one could do a lockless pop using a cmp exchange, but I
    wanted to a scenario where one would want to hold onto both
    locks throughout the critical section for some reason.
    I just don't see how this works in the proposed scenario.

    You want to hold lock A while lock B us being used? That can be tricky.

    Really? This happens all the time. Of course it can be tricky,
    as concurrent programming always is, but it's incredibly normal.

    Locking order comes into play here. Uggg... I have seen nightmare code
    that was using recursion as well... It deadlocked at a certain depth,
    only under the right conditions. The locking order was not properly >respected...

    Well, of course it does. But it's still something that people
    do all the time.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Chris M. Thomasson on Tue Mar 25 11:17:58 2025
    In article <vrt323$27c13$3@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/24/2025 7:02 PM, Chris M. Thomasson wrote:
    On 3/24/2025 1:14 PM, Dan Cross wrote:
    In article <vrsaro$1faj3$9@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/24/2025 11:49 AM, Dan Cross wrote:
    [snip]
    I'm sure one could do a lockless pop using a cmp exchange, but I
    wanted to a scenario where one would want to hold onto both
    locks throughout the critical section for some reason.
    I just don't see how this works in the proposed scenario.

    You want to hold lock A while lock B us being used? That can be tricky. >>>
    Really?  This happens all the time.  Of course it can be tricky,
    as concurrent programming always is, but it's incredibly normal.

    Locking order comes into play here. Uggg... I have seen nightmare code
    that was using recursion as well... It deadlocked at a certain depth,
    only under the right conditions. The locking order was not properly
    respected...

    A thread would take mtxA, then mtxB, while another would take mtxB, then >mtxA. Deadlock city is going to be visited rather soon...

    Yeah, so don't do that. :-)

    I can whack myself in the head with a ballpeen hammer, too. It
    is better if I do not.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Chris M. Thomasson on Tue Mar 25 11:20:39 2025
    In article <vrt37r$27c13$4@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/24/2025 1:28 PM, Dan Cross wrote:
    In article <96987d473656f296ec63c30626a46730@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 24 Mar 2025 18:02:25 +0000, Stefan Monnier wrote:

    MitchAlsup1 [2025-03-24 17:36:28] wrote:
    On Mon, 24 Mar 2025 12:37:13 +0000, Dan Cross wrote:
    Consider something as simple as popping an item off of the front
    of a queue and inserting it into an ordered singly-linked list:
    [...]
    I have almost that in the documentation for my ATOMIC stuff::
    The following code moves a doubly linked element from one place
    in a concurrent data structure to another without any interested
    3rd party being able to see that the element is now within the
    CDS at any time::

    Would your "ATOMIC stuff" really handle Dan's example, tho?
    A key element in Dan's example if the "ordered singly-linked list",
    i.e. you need to walk the list an unbounded number of steps until you
    can find the insertion point, so I don't think you can do the whole
    thing inside your ATOMIC stuff.

    Technically, yes, you can put a search loop in an ATOOMIC event.
    Whether you want to or not is a different story. My example
    illustrated the ATOMIC manipulation of the event without a search.

    In addition, if the search it long, code developer should either
    to the search before the event, or grab a Mutex to prevent
    interference during the search and rest of the critical region.

    The problem, as specified, excluded this idea that the list was
    long.

    The point was simply to show a scenario where your atomic
    operation proposal breaks down, and where you might want to
    disable interrupt delivery. The point wasn't to ask whether
    the specific problem could be recast in a different way; it
    probably can. The question was to ask whether your proposal
    could support that specific problem. I don't see how it can.

    IOW, you'd have to somehow move the list-walk outside of the atomic
    section.

    Yes,

    E.g. if you know elements are only ever removed from the tail of the
    list, you can presumably walk the list outside of the ATOMIC stuff and >>>> then do the "remove from the queue and insert into the list" atomically >>>> since it's a small&finite number of steps (with a test to loop back to >>>> the list-walk if someone else inserted or removed anything at that point >>>> in the mean time).

    Or maintain a tail pointer.

    There's no guarantee that the tail is the correct place to
    insert the element, vis the order of the elements.

    https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html

    If you go back and review the thread, you'd see that the problem
    is to remove an element from a queue, and insert it into an
    ordered list. This part of the discussion is concerning the
    latter operation.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Tue Mar 25 12:23:33 2025
    In article <e0080df32c6f88ebbd4c8828f98f76a9@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 24 Mar 2025 23:45:49 +0000, Dan Cross wrote:
    In article <a86295425dfb043a1b44e550a1c05659@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    [snip]
    My 66000 ATOMIC stuff is designed such that if control has to leave
    the ATOMIC event, all HW knowledge of being in that event disappears
    and IP is modified (prior to control transfer) such that upon return
    SW knows the event failed (or never started), and that no update of >>>participating data in the event ever becomes visible.

    HW can do this with address-based events, but not with data-based
    events. LDL-STC are address-based events used to manipulate data-
    based events. Mine just allow more LDs and STs in the event than 1.

    So, if you want the property whereby the lock disappears on any
    control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>then you want to use my ATOMIC stuff; otherwise, you can use the
    normal ATOMIC primitives everyone and his brother provide.

    I definitely do Not want that property. Having a lock
    "disappear" on delivery of an interrupt seems like it would be a
    great way to introduce silent corruption.

    It means that if a thread owning a lock is interrupted, that
    thread gives the lock back--so there is no priority inversion
    and no lock-starvation. Associated with the give back, and the
    IP of the interrupted thread no longer points inside the
    critical section.

    Nothing gets lost.

    Perhaps we're not clear on what a "lock" is. Here, I'm
    referring to a mutual exclusion primitive that I use to protect
    entry into a critical section. You seem to be talking about
    something else.

    {{Perhaps disappear was a poorly chosen word.}}

    If your atomic primitives are not sufficient to cover an entire
    critical section, as I believe we have established, and are
    mostly intended to be used for building concurrency primitives
    like locks, then I expect that software would take the lock, see
    that it succeeded, and enter the critical section.

    If at that
    point the lock were silently discarded on receipt of an
    interrupt,

    At this point your thread in not IN the critical section !!

    Why not? I successfully acquired a lock, no? What else do you
    propose I must do to enter the critical section?

    The closest you can be is at the first instruction beginning
    an ATOMIC event (the never happened case) or you can also
    be at the point where you KNOW that some other interfered
    with your actions, and that they saw a completely consistent
    state--now it is your turn (again) observing a completely
    consistent but new state.

    Here's a bit of my earlier message that didn't get quoted in
    your reply:

    |If, on the other hand, the atomic primitives are just meant for
    |the spin loop, then I don't really see how that's all that
    |useful compared to LL/SC. And you still need some way to ensure
    |that the critical section is not interrupted.

    It seems like you're saying that the "ATOMIC event" is just for
    the spin loop.

    then potentially some other thread could subsequently
    take the lock,

    Since you are not in the critical section, you don't care.
    No critical data has been modified, there is no inconsistent
    state to be observed; and YOUR Instruction Pointer is not
    pointing into instructions within the ATOMIC event !!

    This seems to be predicated on something that you understand
    about this mechanism that you've proposed, but haven't quite
    explained (or perhaps that I haven't understood).

    So again, I ask my question, how do I get into the critical
    section? Given the text quoted above from my last reply, what
    is the utility of these atomic events vis that goal?

    It sounds like you're saying that I can use them to implement a
    traditional spin lock. But then how are they better than a
    "normal" atomic CAS? You mention respecting priority and in
    particular avoiding inversion, but I don't think that's actually
    true: consider some critical section, where I cannot yield the
    CPU to another thread, even one running at higher priority,
    since I'm manipulating some data structure and it is important
    that that data structure not be in an intermediate state when
    I get switched away from; this strongly implies that I use these
    atomic primitives to build the sort of lock where I set some
    _other_ datum, not the "lock" manipulated in the atomic event,
    to indicate that I'm in the critical section: but if that's the
    case, then what prevents some other, higher priority thread,
    from attempting to enter the critical section _when I'm in it_,
    but not when I'm racing on the lock?

    E.g., this sequence of events:

    Thread A (lo pri) Thread B (hi pri)
    -------- --------
    Atomic lock
    Observe success
    Acquire mutex
    Unlock

    Atomic Lock
    Observe success
    Observe that mutex is already set
    Unlock
    Now what?

    enter the critical section, and observe a data
    structure in an inconsistent state (leading to a crash in the
    best case).

    There is no inconsistent state--in rather the same way a core backs
    up a branch mispredict, the core backs up the ATOMIC event to appear
    as if it never even started !!

    We've already established that this cannot cover all use cases.
    See my example of popping off of a queue and inserting into an
    ordered list.

    This is precisely the sort of thing we build mutual
    exclusion primitives to wrap around critical sections to avoid.

    And also why those primitives give rise to priority inversion
    and lock-starvation.

    But this mechanism is insufficient to actually address those
    problems in anything but the most limited ways. See the example
    above.

    Moreover, those are problems that are well understood and are
    successfully mitigated in real world systems. I do not see any
    evidence that they are sufficiently problematic supporting the
    introduction of a brand new mechanism with poorly understood
    behavior and what seems like very complex semantics.

    If, on the other hand, the atomic primitives are just meant for
    the spin loop, then I don't really see how that's all that
    useful compared to LL/SC.

    Any number of respected people have ask for a pipelined LL-SC;
    where one can LL multiple datums and later SC multiple datums.
    This is what ESM can provide.

    Who are these folks and have any of them chimed in on this
    proposal? It seems like you've gone far beyond "pipelined
    LL-SC" and introduced notions of threads and priorities that
    interact with that in complex and, frankly, undesirable ways.

    And you still need some way to ensure
    that the critical section is not interrupted.

    Just a way to detect if it happened; and if it happens;
    means to back upto a point where it does not matter if
    interference happened "right there"--that is: outside of
    a critical section.

    This is based on the fallacious assumption that that is possible
    in the general case. I think I have shown that it is not.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Dan Cross on Tue Mar 25 13:43:32 2025
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <684d32898ec85b91bcb9dcdb97d8065a@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    My ATOMIC stuff allows for the construction, in SW, of pretty
    much any ATOMIC primitive anyone would like. It is, in general,
    not suited to guard critical regions, but to grab the lock which
    does.

    Ok, that's good to understand.

    Most uses of My ATOMIC stuff is to grab/change/update data
    that then guards various critical regions. More like a multi-
    valued version of LDL-STC.

    I don't see how constructing the primitive that way is terribly
    useful compared to just issuing multiple load-locked/store-cond
    instructions; in particular, it doesn't seem like the sort of
    thing that composes well: every sequence that uses it must be
    coded to use it, explicitly, instead of just nesting calls to
    "lock".

    LL/SC has been shown to be less efficient for contended
    locks than atomic operations when the core count exceeds
    single digits.

    It's difficult to argue against the standard set of atomic
    primitives (the PCIe set: compare and swap, load and add,
    swap) for scalability; ARM has extended the list to
    include exclusive or, set-bit, clear-bit, min, max, et al.

    Analysis comparing with LL/SC:

    https://cpufun.substack.com/p/atomics-in-aarch64

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Mar 25 13:45:39 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 24 Mar 2025 23:45:49 +0000, Dan Cross wrote:

    So, if you want the property whereby the lock disappears on any
    control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>then you want to use my ATOMIC stuff; otherwise, you can use the
    normal ATOMIC primitives everyone and his brother provide.

    I definitely do Not want that property. Having a lock
    "disappear" on delivery of an interrupt seems like it would be a
    great way to introduce silent corruption.

    It means that if a thread owning a lock is interrupted, that
    thread gives the lock back--so there is no priority inversion
    and no lock-starvation. Associated with the give back, and the
    IP of the interrupted thread no longer points inside the
    critical section.

    Nothing gets lost.

    This isn't a hardware transaction. If the thread has updated data
    and you take away the lock, the system is now inconsistent
    and nasal daemons will ensue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Mar 25 15:01:22 2025
    On Tue, 25 Mar 2025 13:45:39 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    Nothing gets lost.

    This isn't a hardware transaction. If the thread has updated data
    and you take away the lock, the system is now inconsistent
    and nasal daemons will ensue.

    In HW there is a little thing known as branch misprediction repair,
    where all the work done is made to appear as if those instructions
    were never executed to begin with. The size of the branch mispredict
    window is on the order of 300 instructions now.

    With Spectré and Meltdown even cache updates must be delayed until
    after the causing instruction retires.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Scott Lurndal on Tue Mar 25 15:08:18 2025
    In article <8qyEP.1257951$FVcd.630606@fx10.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <684d32898ec85b91bcb9dcdb97d8065a@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    My ATOMIC stuff allows for the construction, in SW, of pretty
    much any ATOMIC primitive anyone would like. It is, in general,
    not suited to guard critical regions, but to grab the lock which
    does.

    Ok, that's good to understand.

    Most uses of My ATOMIC stuff is to grab/change/update data
    that then guards various critical regions. More like a multi-
    valued version of LDL-STC.

    I don't see how constructing the primitive that way is terribly
    useful compared to just issuing multiple load-locked/store-cond >>instructions; in particular, it doesn't seem like the sort of
    thing that composes well: every sequence that uses it must be
    coded to use it, explicitly, instead of just nesting calls to
    "lock".

    LL/SC has been shown to be less efficient for contended
    locks than atomic operations when the core count exceeds
    single digits.

    It's difficult to argue against the standard set of atomic
    primitives (the PCIe set: compare and swap, load and add,
    swap) for scalability; ARM has extended the list to
    include exclusive or, set-bit, clear-bit, min, max, et al.

    Analysis comparing with LL/SC:

    https://cpufun.substack.com/p/atomics-in-aarch64

    That analysis is interesting; thanks for pointing it out. I'm
    not sure that it is directly applicable to locks, though, since
    it's primarily concerned with atomic increment. Semaphores,
    perhaps; something like `ldaddal` would work well in the
    implementation of `xadd`, as in this paper by Mullender and Cox: https://swtch.com/semaphore.pdf

    Highly contended spin locks based on CAS loops tend not to scale
    well as core counts increase, hence more elaborate lock types
    like MCS locks, CLH locks, and so on. Boyd-Wickizer et al go
    into this in detail in https://people.csail.mit.edu/nickolai/papers/boyd-wickizer-locks.pdf
    I rather doubt that single-instruction atomics versus LL-SC are
    a going to make a big difference here. But I also wonder at the
    internal implementation of these instructions in e.g. that
    Fujitsu chip that allows throughput to stay more or less
    constant as the number of cores increases; I'd think that cache
    coherency traffic as threads contend for a single cache line
    would create a bottleneck leading to (at least) a linear
    slow down. This suggests to me that perhaps the performance
    characteristcs are less in the specific instructions used (one
    versus LL-SC), but rather how those instructions are
    implemented with respect to the coherency protocol.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Tue Mar 25 15:10:19 2025
    In article <b201b9212700f9359a831e03b1d60cef@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 25 Mar 2025 13:45:39 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    Nothing gets lost.

    This isn't a hardware transaction. If the thread has updated data
    and you take away the lock, the system is now inconsistent
    and nasal daemons will ensue.

    In HW there is a little thing known as branch misprediction repair,
    where all the work done is made to appear as if those instructions
    were never executed to begin with. The size of the branch mispredict
    window is on the order of 300 instructions now.

    With Spectré and Meltdown even cache updates must be delayed until
    after the causing instruction retires.

    But again, that's irrelevant. The "critical sections" in
    question here are software abstractions created by system
    software. What you're describing is at the microarchitectural
    level; two totally different worlds. I think the fundamental
    mismatch here is that you're trying to apply this solution you
    have created for something at the uarch level to a problem at
    the systems software level. It's jut not a good fit.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Dan Cross on Tue Mar 25 10:54:29 2025
    Dan Cross wrote:
    In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:
    Sometimes you really don't want to be interrupted.
    And sometimes you don't want to be interrupted unless the
    "house is on fire"; I cannot see a time when "the house is
    on fire" that you would not want to take the interrupt.

    Is there one ?!?
    Consider a thread that takes a spinlock; suppose some
    high-priority interrupt comes in while the thread is holding
    that lock. In response to the interrupt, software decides to
    suspend the thread and switch some other thread; that thread
    wants to lock the spin lock that the now-descheduled thread is
    holding: a classic deadlock scenario.
    Terminology: mutexes coordinate mutual exclusion between threads,
    spinlocks coordinate mutual exclusion between cpu cores.
    Windows "critical sections" are mutexes with a fast path.

    A spin lock is simply any lock where you spin trying to acquire
    the lock, as opposed to a blocking synchronization protocol.
    Here I'm using the terminology of Herlihy and Shavit [Her08].

    I'm using terminology in common use on multiprocessor systems like
    VMS since 1980's and later WinNT.

    Each OS data structure has different reentrancy requirements.
    - Ones accessed by threads are guarded by mutexes.
    Example on Windows would be the page tables as paging is done by threads.
    - Ones accessed by OS software interrupt level are guarded by spinlocks.
    Example on Windows is the Scheduler tables.
    - Ones accessed by HW interrupts are guarded by interrupt spinlocks.
    Example on Windows is device interrupt service routines.

    Traditional Unix "sleep" and "wakeup" are an example of a
    blocking protocol, where a thread may "sleep" on some "channel",
    yielding the locking thread while the lock cannot be acquired,
    presumably scheduling something else until the thread is later
    marked runnable by virtual of something else calling "wakeup" on
    the sleep channel.

    But note that I am deliberately simplifying in order to
    construct a particular scenario in which a light-weight
    synchronization primitive is being used for mutual exclusion
    between concurrent entities, hence mentioning spinlocks
    specifically.

    Suggesting that spin locks iare only applicable to
    multiprocessing scenarios is flatly incorrect.

    I didn't.
    I am saying the OS by design guards *ITS* different data structures
    with different mechanisms, and those mechanism must only be used in
    a very controlled manner. And if you want to do something that requires
    access to one of *ITS* data structures then you must obey *ITS* rules.

    In Windows and VMS and from what I understand of *nix spinlocks guard most
    of the core OS data structures with spinlocks acquired at the software interrupt level. If you are not at that level then the spinlock will not provide the necessary synchronization for access to that data.

    Software may use
    the technique to spin on a "busy" bit on some IP in a device for
    example, even on a uniprocessor system.

    A valid response here might be, "don't context switch from the
    interrupt handler; use a DPC instead". That may be valid, but
    it puts a constraint on the software designer that may be onerus
    in practice: suppose the interrupt happens in the scheduler,
    while examining a run queue or something. A DPC object must be
    available, etc.
    That is exactly what DPC/SoftIrq are design to do - allow the bulk of
    the kernel, like the scheduler, non-paged heap management, IO pre and
    post processing, to be interrupted without causing reentrancy.
    The higher level device interrupt enqueues a request to the lower software >> interrupt level, which is processed when it will not cause reentrancy.

    That constraint is by design and any good OS will crash if violated.

    Yes, I'm aware of the technique. That wasn't the point.

    Further, software must now consider the complexity of
    potentially interruptable critical sections. From the
    standpoint of reasoning about already-complex concurrency issues
    it's simpler to be able to assert that (almost) all interrupt
    delivery can be cheaply disabled entirely, save for very
    special, specific, scenarios like NMIs. Potentially switching
    away from a thread holding a spinlock sort of defeats the
    purpose of a spinlock in the first place, which is a mutex
    primitive designed to avoid the overhead of switching.
    This is why I mentioned the terminology thing: threads do not hold
    spinlocks, they hold mutexes.

    See above. Threads can certainly "hold" a spin lock, as they
    can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
    page 202:

    When a particular OS defined spinlock is used to guard a particular OS
    defined data structure, exactly WHEN the OS allows this decides how
    the OS avoids all the nasty interrupt pitfalls you identified.

    |On a uniprocessor, if a thread tries to acquire a spin lock
    |that is already held, it will loop forever. Multiprocessor
    |algorithms, however, must operate correctly regardless of the
    |number of processors, which means that they should handle the
    |uniprocessor case as well. This requires strict adherence to
    |the rule that threads not relinquish control of the CPU while
    |holding a spin lock.

    Yes, which is what I was saying.
    I am describing HOW that is accomplished inside an OS.

    The way you prevent a thread from relinquishing control while holding a spinlock is to first switch hats from thread level to software interrupt
    level, which blocks the scheduler because it also runs at SWIL,
    then acquire the spinlock.

    Raising the interrupt priority level from passive (aka thread) level
    to SWI level (called dispatch level on Windows) blocks the thread switch.
    This ensures non-reentrant algorithms are serialized.
    The core may then acquire spinlocks guarding data structures.

    Threads and mutexes are a fiction created
    by the OS scheduler, and switching threads while waiting for a mutex
    is exactly what they are supposed to do.

    Spinlocks are used by cores to coordinate access to shared memory by
    things like a OS scheduler looking at list of threads waiting for a mutex.

    Spinlocks, of the form I was describing, are an optimization for
    cases where the critical section guarded by the lock is short,
    and the overhead of invoking a scheduler and context switching
    to some other thread is higher than just spinning until the lock
    becomes available again. The whole point is to avoid blocking
    and switching.

    E.g., as [Vaj11] puts it (page 98):

    |Spin locks have the advantage of preventing pre-emption by the
    |OS (due to blocking while waiting for the lock)

    That is, the advantage is because the thread does not block and
    yield.

    And I am describing HOW the OS accomplishes this.
    Disabling the scheduler is the same as raising the priority to SWI level.
    Once the scheduler is disabled THEN you may acquire a spinlock
    that guards a shared data structure accessed at SWI level.

    But you may NOT access a shared data structure accessed at interrupt level.
    For that you'd need an interrupt spinlock acquired at the device ISR
    priority interrupt level. And note that the ISR interrupt priority level
    is assigned by the OS itself, usually when the device is mounted.

    I like to think of it as all having to do with hats.
    The cpu is wearing one of three hats: a thread hat when it pretends to be
    a time shared execution machine; a core hat when it acts as a single
    execution machine running non-reentrant things like a scheduler which
    creates the fiction of threads and processes (multiple virtual spaces);
    and an interrupt hat when executing nestable reentrant interrupts.

    The interrupt mechanism coordinates exactly when and how we switch hats.

    I suppose you are lumping system calls into the overall bucket
    of "interrupts" in this model, regardless of whether one might
    use a dedicated supervisor call instruction (e.g., something
    like x86 SYSENTER or SYSCALL or ARM SVC or RISC-V's ECALL)?

    Syscall switches a thread from user mode to supervisor mode.

    The hat analogy feels strained to me, and too colloquial to be
    useful.

    I offered it because it answers your prior questions about how
    OS avoid the kinds of deadlocks and race conditions you described.

    If something you want to do requires a certain hat and you are not
    wearing that hat then you have to get one.

    Inside an HW interrupt service routine and want to run a thread?
    ISR (interrupt hat) posts a Dpc to software interrupt level (SWI hat) Dispatcher, which eventually runs calls a routine that accesses scheduler tables guarded by a spinlock (that requires the SWI hat), that selects your thread so when OS returns (thread hat) your application code executes.

    - Dan C.

    References:

    [Her08] Maurice Herlihy and Nir Shavit. 2008. _The Art of
    Multiprocessor Programming_. Morgan Kaufmann, Burlington, MA.

    [Vah96] Uresh Vahalia. 1996. _Unix Internals: The New
    Frontiers_. Prentice Hall, Upper Saddle River, NJ.

    [Vaj11] Andras Vajda. 2011. _Programming Many-Core Chips_.
    Springer, New York, NY.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Mar 25 11:48:11 2025
    a going to make a big difference here. But I also wonder at the
    internal implementation of these instructions in e.g. that
    Fujitsu chip that allows throughput to stay more or less
    constant as the number of cores increases; I'd think that cache
    coherency traffic as threads contend for a single cache line
    would create a bottleneck leading to (at least) a linear
    slow down.

    IIUC the idea is that the atomic-add is sent to the memory and performed
    there, instead of moving the data to the cache. So there's no
    contention for the cache line because the data is not in a cache at all.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Tue Mar 25 16:40:17 2025
    On Tue, 25 Mar 2025 15:10:19 +0000, Dan Cross wrote:

    In article <b201b9212700f9359a831e03b1d60cef@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 25 Mar 2025 13:45:39 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    Nothing gets lost.

    This isn't a hardware transaction. If the thread has updated data
    and you take away the lock, the system is now inconsistent
    and nasal daemons will ensue.

    In HW there is a little thing known as branch misprediction repair,
    where all the work done is made to appear as if those instructions
    were never executed to begin with. The size of the branch mispredict
    window is on the order of 300 instructions now.

    With Spectré and Meltdown even cache updates must be delayed until
    after the causing instruction retires.

    But again, that's irrelevant. The "critical sections" in
    question here are software abstractions created by system
    software.

    Say you have a critical section (of your description) that was
    preceded by an if-statement that was predicted to enter CS, but
    1,000 cycles later the branch is resolved to not enter the CS.
    {{Branch was dependent on an MMI/O location down the PCIe tree
    which is why it took so long to resolve.}}

    HW Backs the core up so that the RF and all memory locations
    contain their original values at the point of the mispredicted
    branch.

    Can thread running on core even tell if you entered CS or not !

    No, your thread cannot.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stefan Monnier on Tue Mar 25 16:29:09 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    a going to make a big difference here. But I also wonder at the
    internal implementation of these instructions in e.g. that
    Fujitsu chip that allows throughput to stay more or less
    constant as the number of cores increases; I'd think that cache
    coherency traffic as threads contend for a single cache line
    would create a bottleneck leading to (at least) a linear
    slow down.

    IIUC the idea is that the atomic-add is sent to the memory and performed >there, instead of moving the data to the cache. So there's no
    contention for the cache line because the data is not in a cache at all.

    Exactly. The LLC may also support the atomics directly.

    Additionally, with PCIe, the atomic can be executed at
    the endpoint function.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Tue Mar 25 16:31:56 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Dan Cross wrote:

    I didn't.
    I am saying the OS by design guards *ITS* different data structures
    with different mechanisms, and those mechanism must only be used in
    a very controlled manner. And if you want to do something that requires >access to one of *ITS* data structures then you must obey *ITS* rules.


    https://en.wikipedia.org/wiki/Incompatible_Timesharing_System

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Mar 25 13:28:27 2025
    IIUC the idea is that the atomic-add is sent to the memory and performed >>there, instead of moving the data to the cache. So there's no
    contention for the cache line because the data is not in a cache at all.

    Exactly. The LLC may also support the atomics directly.

    Additionally, with PCIe, the atomic can be executed at
    the endpoint function.

    It should even be possible (no idea if anyone tried to implement it) to
    combine the operations as they bubble up the memory, so your
    N increments on your N cores can turn into N/64 additions of 64 leaving
    each group of 64 cores, making it possible to scale with the number
    of cores.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to ThatWouldBeTelling@thevillage.com on Tue Mar 25 17:44:25 2025
    In article <8uzEP.115550$Xq5f.66883@fx38.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:
    Sometimes you really don't want to be interrupted.
    And sometimes you don't want to be interrupted unless the
    "house is on fire"; I cannot see a time when "the house is
    on fire" that you would not want to take the interrupt.

    Is there one ?!?
    Consider a thread that takes a spinlock; suppose some
    high-priority interrupt comes in while the thread is holding
    that lock. In response to the interrupt, software decides to
    suspend the thread and switch some other thread; that thread
    wants to lock the spin lock that the now-descheduled thread is
    holding: a classic deadlock scenario.

    Terminology: mutexes coordinate mutual exclusion between threads,
    spinlocks coordinate mutual exclusion between cpu cores.
    Windows "critical sections" are mutexes with a fast path.

    A spin lock is simply any lock where you spin trying to acquire
    the lock, as opposed to a blocking synchronization protocol.
    Here I'm using the terminology of Herlihy and Shavit [Her08].

    I'm using terminology in common use on multiprocessor systems like
    VMS since 1980's and later WinNT.

    Each OS data structure has different reentrancy requirements.
    - Ones accessed by threads are guarded by mutexes.
    Example on Windows would be the page tables as paging is done by threads.
    - Ones accessed by OS software interrupt level are guarded by spinlocks.
    Example on Windows is the Scheduler tables.
    - Ones accessed by HW interrupts are guarded by interrupt spinlocks.
    Example on Windows is device interrupt service routines.

    I'm not sure what you're saying here, or rather, how it is
    relevant.

    Your earlier statement was that "mutexes coordinate mutual
    exclusion between threads, spinlocks coordinate mutual exclusion
    between cpu cores." But this is not the case.

    Taking VMS as an example, from [Gol94], sec 9.5 (page 255):

    |A thread of execution running on one processor acquires a
    |spinlock to serialize access to the data with threads of
    |execution running on other processors. Before acquiring
    |the spinlock, the thread of executions raises IPL to block
    |accesses by other threads of execution running on the same
    |processor. The IPL value is determined by the spinlock being
    |locked.

    Goldberg and Saravanan seems quite explicit that threads can
    acquire and hold spinlocks on OpenVMS, which seems to be
    something that you were objecting to?

    Turning to Windows, [Rus12] says about spin locks on page 179:

    |Testing and acquiring the lock in one instruction prevents a
    |second thread from grabbing the lock between the time the first
    |thread tests the variable and the time it acquires the lock.

    Note that, again, the language here is thread-centric.

    By the way, regarding terminology, on page 177 [Rus12] also
    says,

    |Sections of code that access a nonshareable resource are called
    |_critical_ sections. To ensure correct code, only one thread
    |at a time can execute in a critical section.

    This is the generic definition of a critical section and note
    that this is the context in which we are using the term
    throughout this thread.

    What you referred to earlier as a "critical section" in Windows
    is a _userspace_ synchronization primitive with a kernel assist
    and mostly unrelated to concerns around interruptibility. From
    the description on page 201 of [Rus12], they seem similar to
    futexes on Linux, though somewhat more limited: Russinovich et
    al states that they cannot be used between processes, while a
    Linux futex can if it's in a region of memory shared between
    cooperating processes.

    Traditional Unix "sleep" and "wakeup" are an example of a
    blocking protocol, where a thread may "sleep" on some "channel",
    yielding the locking thread while the lock cannot be acquired,
    presumably scheduling something else until the thread is later
    marked runnable by virtual of something else calling "wakeup" on
    the sleep channel.

    But note that I am deliberately simplifying in order to
    construct a particular scenario in which a light-weight
    synchronization primitive is being used for mutual exclusion
    between concurrent entities, hence mentioning spinlocks
    specifically.

    Suggesting that spin locks iare only applicable to
    multiprocessing scenarios is flatly incorrect.

    I didn't.

    Your exacty words were, "spinlocks coordinate mutual exclusion
    between cpu cores." See the text quoted above. This is
    incorrect in that it ignores that threads are, in contxt, the
    software primitive for the computational entities that are doing
    the acquiring, and while the "between cpu core" part is the
    usual case, it is not the only case.

    I am saying the OS by design guards *ITS* different data structures
    with different mechanisms, and those mechanism must only be used in
    a very controlled manner. And if you want to do something that requires >access to one of *ITS* data structures then you must obey *ITS* rules.

    This is axiomatic.

    But I fail to see how it's relevant to the topic at hand. In
    context we're discussing scenarios where a system may want to
    disable interrupts entirely, and whether a particular aspect of
    a hardware design is sufficient to implement critical sections
    generally.

    In Windows and VMS and from what I understand of *nix spinlocks guard most
    of the core OS data structures with spinlocks acquired at the software >interrupt level. If you are not at that level then the spinlock will not >provide the necessary synchronization for access to that data.

    Precisely.

    Software may use
    the technique to spin on a "busy" bit on some IP in a device for
    example, even on a uniprocessor system.

    A valid response here might be, "don't context switch from the
    interrupt handler; use a DPC instead". That may be valid, but
    it puts a constraint on the software designer that may be onerus
    in practice: suppose the interrupt happens in the scheduler,
    while examining a run queue or something. A DPC object must be
    available, etc.
    That is exactly what DPC/SoftIrq are design to do - allow the bulk of
    the kernel, like the scheduler, non-paged heap management, IO pre and
    post processing, to be interrupted without causing reentrancy.
    The higher level device interrupt enqueues a request to the lower software >>> interrupt level, which is processed when it will not cause reentrancy.

    That constraint is by design and any good OS will crash if violated.

    Yes, I'm aware of the technique. That wasn't the point.

    Further, software must now consider the complexity of
    potentially interruptable critical sections. From the
    standpoint of reasoning about already-complex concurrency issues
    it's simpler to be able to assert that (almost) all interrupt
    delivery can be cheaply disabled entirely, save for very
    special, specific, scenarios like NMIs. Potentially switching
    away from a thread holding a spinlock sort of defeats the
    purpose of a spinlock in the first place, which is a mutex
    primitive designed to avoid the overhead of switching.

    (It occurs to me that perhaps this is why you are going on about
    how VMS and windows uses spin locks, and how they interact with
    interrupt priority levels.)

    This is why I mentioned the terminology thing: threads do not hold
    spinlocks, they hold mutexes.

    See above. Threads can certainly "hold" a spin lock, as they
    can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
    page 202:

    When a particular OS defined spinlock is used to guard a particular OS >defined data structure, exactly WHEN the OS allows this decides how
    the OS avoids all the nasty interrupt pitfalls you identified.

    How is this relevant to what you wrote earlier and my response,
    quoted just above? My statement was that threads can hold spin
    locks. This is obviously true.

    |On a uniprocessor, if a thread tries to acquire a spin lock
    |that is already held, it will loop forever. Multiprocessor
    |algorithms, however, must operate correctly regardless of the
    |number of processors, which means that they should handle the
    |uniprocessor case as well. This requires strict adherence to
    |the rule that threads not relinquish control of the CPU while
    |holding a spin lock.

    Yes, which is what I was saying.
    I am describing HOW that is accomplished inside an OS.

    Why? First, it's well-known, and second, that's not relevant to
    the discussion at hand.

    But it also doesn't seem like that's what you're saying; you
    asserted that a thread (a software thread in this context) does
    not "hold" a spin lock; that's incorrect.

    The way you prevent a thread from relinquishing control while holding a >spinlock is to first switch hats from thread level to software interrupt >level, which blocks the scheduler because it also runs at SWIL,
    then acquire the spinlock.

    Raising the interrupt priority level from passive (aka thread) level
    to SWI level (called dispatch level on Windows) blocks the thread switch. >This ensures non-reentrant algorithms are serialized.
    The core may then acquire spinlocks guarding data structures.

    Just so. Earlier you said that threads cannot hold spin locks;
    in the text quoted just above, you said that they do.

    As for this specific scenario of specific interrupt priority
    levels, I would go further and say that is just a detail. In
    the simplest systems, ie, the kind we might reach for as an
    expository example when discussing these kinds of issues, one
    just disables interrupts completely when holding a spinlock.

    Threads and mutexes are a fiction created
    by the OS scheduler, and switching threads while waiting for a mutex
    is exactly what they are supposed to do.

    Spinlocks are used by cores to coordinate access to shared memory by
    things like a OS scheduler looking at list of threads waiting for a mutex. >>
    Spinlocks, of the form I was describing, are an optimization for
    cases where the critical section guarded by the lock is short,
    and the overhead of invoking a scheduler and context switching
    to some other thread is higher than just spinning until the lock
    becomes available again. The whole point is to avoid blocking
    and switching.

    E.g., as [Vaj11] puts it (page 98):

    |Spin locks have the advantage of preventing pre-emption by the
    |OS (due to blocking while waiting for the lock)

    That is, the advantage is because the thread does not block and
    yield.

    And I am describing HOW the OS accomplishes this.

    Well, what you are describing is how _a_ specific family of
    operating systems that you are familiar with accomplishes this,
    and that's all well and good, but it's not all that relevant to
    discussing the pitfalls of this "ATOMIC event" proposal or
    addressing the OP's questions about scenarios in which interrupt
    disable is used.

    Disabling the scheduler is the same as raising the priority to SWI level. >Once the scheduler is disabled THEN you may acquire a spinlock
    that guards a shared data structure accessed at SWI level.

    But you may NOT access a shared data structure accessed at interrupt level. >For that you'd need an interrupt spinlock acquired at the device ISR
    priority interrupt level. And note that the ISR interrupt priority level
    is assigned by the OS itself, usually when the device is mounted.

    You are presuming a very specific system architecture and going
    into great detail about its implementation. That's fine, and
    perhaps useful I suppose, but again, I don't think it's very
    relevant to the context of this discussion.

    I like to think of it as all having to do with hats.
    The cpu is wearing one of three hats: a thread hat when it pretends to be >>> a time shared execution machine; a core hat when it acts as a single
    execution machine running non-reentrant things like a scheduler which
    creates the fiction of threads and processes (multiple virtual spaces);
    and an interrupt hat when executing nestable reentrant interrupts.

    The interrupt mechanism coordinates exactly when and how we switch hats.

    I suppose you are lumping system calls into the overall bucket
    of "interrupts" in this model, regardless of whether one might
    use a dedicated supervisor call instruction (e.g., something
    like x86 SYSENTER or SYSCALL or ARM SVC or RISC-V's ECALL)?

    Syscall switches a thread from user mode to supervisor mode.

    Your statement was that "the interrupt mechanism coordinates
    exactly when and how we switch hats". But that's incomplete; if
    a system call blocks, for example, the typical behavior is to
    yield to the scheduler to pick something else to do. Clearly,
    that is another way that coordinates this business of changing
    one's hat.

    The hat analogy feels strained to me, and too colloquial to be
    useful.

    I offered it because it answers your prior questions about how
    OS avoid the kinds of deadlocks and race conditions you described.

    I'm afraid you answered a question that I did not ask, then.

    I am discussing the specifics of this architectural proposal,
    and specifically two aspects of it: one is addressing the
    proposer's questions about scenarios where one might want to
    disable interrupts entirely.

    Another, which has emerged somewhat more recently in the thread,
    is this mechanism which seems to be a semi-transactional
    elaboration of LL-SC primitives.

    Any questions I did ask were Socratic in nature, and in the
    context of these architectural discussions. I did not ask any
    questions about basic design or implementation of
    synchronization primitives.

    - Dan C.

    References:

    [Gol94] Ruth E. Goldberg and Saro Saravanan. 1994. _OpenVMS AXP
    Internals and Data Structures_ (Vers. 1.5). Digital Press,
    Newton, MA.

    [Rus12] Mark Russinovich, David A. Solomon and Alex Ionescu.
    2012. _Windows Internals: Part 1_. Microsoft Press, Redmond, WA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to monnier@iro.umontreal.ca on Tue Mar 25 17:47:11 2025
    In article <jwviknxp0zj.fsf-monnier+comp.arch@gnu.org>,
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    a going to make a big difference here. But I also wonder at the
    internal implementation of these instructions in e.g. that
    Fujitsu chip that allows throughput to stay more or less
    constant as the number of cores increases; I'd think that cache
    coherency traffic as threads contend for a single cache line
    would create a bottleneck leading to (at least) a linear
    slow down.

    IIUC the idea is that the atomic-add is sent to the memory and performed >there, instead of moving the data to the cache. So there's no
    contention for the cache line because the data is not in a cache at all.

    Interesting; that's very clever.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Tue Mar 25 13:53:42 2025
    Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    a going to make a big difference here. But I also wonder at the
    internal implementation of these instructions in e.g. that
    Fujitsu chip that allows throughput to stay more or less
    constant as the number of cores increases; I'd think that cache
    coherency traffic as threads contend for a single cache line
    would create a bottleneck leading to (at least) a linear
    slow down.
    IIUC the idea is that the atomic-add is sent to the memory and performed
    there, instead of moving the data to the cache. So there's no
    contention for the cache line because the data is not in a cache at all.

    Exactly. The LLC may also support the atomics directly.

    Additionally, with PCIe, the atomic can be executed at
    the endpoint function.

    Those features are possible but there is no indication that is
    what that web page is measuring.

    I agree with Dan that looks more like a test of the cache efficiency or
    the cache line pinning mechanism than a test of LL/SC vs AtomicXxxx.

    First, it doesn't say what systems these are on.
    It says the cores is Cavium Thunder X2 and Fujitsu A64FX
    but does not describe the systems.

    For each platform it needed to establish a base line performance for
    moving lines between caches, the same ++ loops but without atomics.

    Also Arm is weak ordered. At least one version of those test loops should
    have fences otherwise this doesn't tell you anything about how it might
    behave in real use.

    And as a general guideline, I tend to distrust concurrency benchmarks
    which sit in a tight loop hammering a single memory location because
    they greatly favor whichever core is currently holding the cache line.
    By artificially inflating the contention they really tell you little
    about how it will behave in real world use where there are time gaps
    between core accesses allowing lines to move.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Tue Mar 25 17:54:42 2025
    In article <0343529d63f68c76b5e0d227ef6e39dd@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 25 Mar 2025 15:10:19 +0000, Dan Cross wrote:
    In article <b201b9212700f9359a831e03b1d60cef@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 25 Mar 2025 13:45:39 +0000, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    Nothing gets lost.

    This isn't a hardware transaction. If the thread has updated data
    and you take away the lock, the system is now inconsistent
    and nasal daemons will ensue.

    In HW there is a little thing known as branch misprediction repair,
    where all the work done is made to appear as if those instructions
    were never executed to begin with. The size of the branch mispredict >>>window is on the order of 300 instructions now.

    With Spectré and Meltdown even cache updates must be delayed until
    after the causing instruction retires.

    But again, that's irrelevant. The "critical sections" in
    question here are software abstractions created by system
    software.

    Say you have a critical section (of your description) that was
    preceded by an if-statement that was predicted to enter CS, but
    1,000 cycles later the branch is resolved to not enter the CS.
    {{Branch was dependent on an MMI/O location down the PCIe tree
    which is why it took so long to resolve.}}

    HW Backs the core up so that the RF and all memory locations
    contain their original values at the point of the mispredicted
    branch.

    Can thread running on core even tell if you entered CS or not !

    No, your thread cannot.

    Again, how is that relevant? The branch predictor can only
    handle speculative execution of so-many instructions. Attempts
    to perform more instructions than that will stall the pipeline.

    A critical section as defined by software may contain more than
    that number of instructions. A critical section, as defined by
    software, may touch more cache lines than your atomic event
    proposal can handle. Thus, for your architecture to be useful,
    you must provide some mechanism for handling those
    eventualities.

    Your stuff sounds fine for getting into the critsec; but it
    doesn't help you once you're in it, and it seems like it may
    hurt you if some higher-priority thing comes along and yanks
    your lock out from under you.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Tue Mar 25 20:33:16 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    a going to make a big difference here. But I also wonder at the
    internal implementation of these instructions in e.g. that
    Fujitsu chip that allows throughput to stay more or less
    constant as the number of cores increases; I'd think that cache
    coherency traffic as threads contend for a single cache line
    would create a bottleneck leading to (at least) a linear
    slow down.
    IIUC the idea is that the atomic-add is sent to the memory and performed >>> there, instead of moving the data to the cache. So there's no
    contention for the cache line because the data is not in a cache at all.

    Exactly. The LLC may also support the atomics directly.

    Additionally, with PCIe, the atomic can be executed at
    the endpoint function.

    Those features are possible but there is no indication that is
    what that web page is measuring.

    I agree with Dan that looks more like a test of the cache efficiency or
    the cache line pinning mechanism than a test of LL/SC vs AtomicXxxx.

    First, it doesn't say what systems these are on.
    It says the cores is Cavium Thunder X2 and Fujitsu A64FX
    but does not describe the systems.

    There were actually two Thunder X2 chips - the original was
    the Cavium in-house design, the second was based on the Vulcan
    design that Cavium bought from Broadcom. The Cavium Thunderx2
    chip was rebranded OcteonTx and aimed at the networking market,
    while the Vulcan was named ThunderX2 and sold into the server space.


    For each platform it needed to establish a base line performance for
    moving lines between caches, the same ++ loops but without atomics.

    Also Arm is weak ordered. At least one version of those test loops should >have fences otherwise this doesn't tell you anything about how it might >behave in real use.

    The memory model is formally defined in the ARM Architecture manual.
    There are a number of methods available to deal with ordering and
    consistency, including a memory barrier instruction (DMB) which
    can be applied globally or within sharability domains.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Tue Mar 25 20:40:43 2025
    On Tue, 25 Mar 2025 17:54:42 +0000, Dan Cross wrote:

    In article <0343529d63f68c76b5e0d227ef6e39dd@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    --------------------------

    Say you have a critical section (of your description) that was
    preceded by an if-statement that was predicted to enter CS, but
    1,000 cycles later the branch is resolved to not enter the CS.
    {{Branch was dependent on an MMI/O location down the PCIe tree
    which is why it took so long to resolve.}}

    HW Backs the core up so that the RF and all memory locations
    contain their original values at the point of the mispredicted
    branch.

    Can thread running on core even tell if you entered CS or not !

    No, your thread cannot.

    Again, how is that relevant? The branch predictor can only
    handle speculative execution of so-many instructions.

    Around 300 at present.
    Know of any critical sections that long?

    Attempts
    to perform more instructions than that will stall the pipeline.

    Obviously.

    A critical section as defined by software may contain more than
    that number of instructions.

    Sure, its possible.

    But if DECODE cannot push the entire critical section into the
    execution window, then BY DEFINITION, it MUST be able to make
    the entire critical section appear as if you never started on
    it !!!

    IP gets backed up as if you never entered CS
    RF gets backed up as if you never entered CS
    LOCK variable never got written to SNOOPable cache where others can see
    it
    Cache memory gets put back in the "you never got here" states too.

    Thus, it is like it never happened:: both to the core and to all
    interested 3rd parties.

    Nobody saw the Lock variable change value.
    And
    Nobody saw any STs from the CS change anything in memory.

    Thus, it is as if CS never happened, it was all speculative, and
    got put-right when CS was not allowed to finish.

    BUT-----------------------------
    Code running in CS does believe the the LOCK variable was updated
    AND that STs from CS have been performed by the exit of CS ------------------------------------------------------------

    You see SW sees the nonNeumann model. 1 instruction is Fetched
    and executed in its entirely before proceeding to the next inst-
    ruction. You see individual instructions--as if single stepping.

    HW, on the other hand sees at least 3 points in time
    a) instructions has not arrived to be execcuted
    b) instruction is executing
    c) instruction has completed executing

    And the time span from 1) to c) can be hundreds of cycles
    while there are 300 (or more) instruction contending for
    execution resources.

    In additions, certain instructions may have more than 1 b) units
    of timing--for example a LD-OP-ST has a LD execution, an OP
    execution, and a ST execution.

    But look at the execution of a ST in the shadow of a branch.
    We want ST to be "performed" as soon as possible, so that a
    subsequent LD can obtain the ST-data even before we are allowed
    to write ST-data into the cache. So, your notion of what can be
    backed up (or not) appears not to be able to see all the funny
    business HW goes through to make the core smell fast.

    After the LD receives ST-data, the branch may resolve as a
    bad prediction, and the ST is not allowed to modify cache,
    also the LD was not allowed to see data that was never to be
    written. And HW clean up making it all SMELL like it never
    happened.

    -----------------------------------------------------------

    Interested 3rd parties are another problem to be solved.

    In the midst of a CS a SNOOP comes along and touches a memory
    location you are using in CS.

    a) You can make the CS appear never to have happened
    b) you can delay the SNOOP response
    c) You can wait for the branch to resolve to choose
    ...between a) and b)
    d) same as c) but coherence protocol has a NaK in it.

    (c) is only an option for more advance cache coherence protocols.

    (d) allows a CS in the manifestation part (Sts) to have priority
    (sic.) over interfering SNOOP requests--a soft guarantee of for-
    ward progress.

    A critical section, as defined by
    software, may touch more cache lines than your atomic event
    proposal can handle.

    Sure, but then you would only be using my ATOMICs for locking/
    unlocking not for the work going on.

    Thus, for your architecture to be useful,
    you must provide some mechanism for handling those
    eventualities.

    As stated about 15 responses ago, just use them as ATOMIC primitive
    generators.

    Your stuff sounds fine for getting into the critsec; but it
    doesn't help you once you're in it, and it seems like it may
    hurt you if some higher-priority thing comes along and yanks
    your lock out from under you.

    I conceded that point at least 2 days ago.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Chris M. Thomasson on Tue Mar 25 22:21:08 2025
    In article <vrv483$3ekk$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/25/2025 4:20 AM, Dan Cross wrote:
    In article <vrt37r$27c13$4@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/24/2025 1:28 PM, Dan Cross wrote:
    There's no guarantee that the tail is the correct place to
    insert the element, vis the order of the elements.

    https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html

    If you go back and review the thread, you'd see that the problem
    is to remove an element from a queue, and insert it into an
    ordered list. This part of the discussion is concerning the
    latter operation.

    There are lock-free ways to insert into an ordered list, indeed.

    How is that relevant?

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Mar 25 18:49:29 2025
    Dan Cross [2025-03-25 22:19:07] wrote:
    From what I gather, you've said to just eschew the atomic event
    stuff and build a traditional lock using those primitives. But
    you have also said that a traditional lock can be silently
    unlocked if the thread holding it (which, apparently, is an
    abstraction known to hardware) is preempted. It will be like
    the lock had never been acquired.

    Here's the misunderstanding: the lock that can be silently taken away
    is the lock implemented in the CPU to run an ATOMIC sequence, not the
    lock implemented as a data-structure on top of it: once you've run
    (atomically) your `lock_acquire`, of course that lock is yours. The CPU doesn't even know that this data-structure is a lock and has no
    way to steal it.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Tue Mar 25 22:19:07 2025
    In article <e4ecfb53ef4f33f10b6d2548d5930159@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 25 Mar 2025 17:54:42 +0000, Dan Cross wrote:

    In article <0343529d63f68c76b5e0d227ef6e39dd@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    --------------------------

    Say you have a critical section (of your description) that was
    preceded by an if-statement that was predicted to enter CS, but
    1,000 cycles later the branch is resolved to not enter the CS.
    {{Branch was dependent on an MMI/O location down the PCIe tree
    which is why it took so long to resolve.}}

    HW Backs the core up so that the RF and all memory locations
    contain their original values at the point of the mispredicted
    branch.

    Can thread running on core even tell if you entered CS or not !

    No, your thread cannot.

    Again, how is that relevant? The branch predictor can only
    handle speculative execution of so-many instructions.

    Around 300 at present.
    Know of any critical sections that long?

    Oh, absolutely.

    But that's besides the point. The point is that software _can_
    exceed that limit, even if it's a bad idea to do so; it may not
    even be immediately obvious to the programmer, who is most
    likely working in a high-level langauge, not assembler.

    How will your design accommodate that eventuality?

    Attempts
    to perform more instructions than that will stall the pipeline.

    Obviously.

    A critical section as defined by software may contain more than
    that number of instructions.

    Sure, its possible.

    But if DECODE cannot push the entire critical section into the
    execution window, then BY DEFINITION, it MUST be able to make
    the entire critical section appear as if you never started on
    it !!!

    So ... what, you just don't execute it? Always skip over it?
    Software needs to ensure serial access to some resource. That's
    what critical sections are for. Your architecture needs to
    support that.

    From what I gather, you've said to just eschew the atomic event
    stuff and build a traditional lock using those primitives. But
    you have also said that a traditional lock can be silently
    unlocked if the thread holding it (which, apparently, is an
    abstraction known to hardware) is preempted. It will be like
    the lock had never been acquired.

    But here's the impedence mismatch: it appears that that happens
    _after_ software has already made a decision based on something
    that is rolled back later. If that's mistaken, then say so, but
    if not, then the scheme is fatally flawed: there is literally no
    way you can use it reliably to build anything that doesn't fit
    into your atomic window.

    IP gets backed up as if you never entered CS
    RF gets backed up as if you never entered CS
    LOCK variable never got written to SNOOPable cache where others can see
    it
    Cache memory gets put back in the "you never got here" states too.

    [...a lot of the same stuff, repeated over and over snipped]

    A critical section, as defined by
    software, may touch more cache lines than your atomic event
    proposal can handle.

    Sure, but then you would only be using my ATOMICs for locking/
    unlocking not for the work going on.

    Indeed; that's what I've been saying.

    But you also seem to imply that if a higher priority interrupt
    comes along and preempts the code that's in the critical section
    it will silently unlock the lock that is used to serialize entry
    into the critical section. Is that correct? If so, do you see
    how that makes the whole facility useless for building robust
    mutual exclusion primitives? A lock that is silently released
    because something else comes along and wants a turn instead is
    no lock at all.

    Thus, for your architecture to be useful,
    you must provide some mechanism for handling those
    eventualities.

    As stated about 15 responses ago, just use them as ATOMIC primitive >generators.

    Yes. Was that the same response where you said the lock would
    be silently released if a higher-priority interrupt came along?

    Your stuff sounds fine for getting into the critsec; but it
    doesn't help you once you're in it, and it seems like it may
    hurt you if some higher-priority thing comes along and yanks
    your lock out from under you.

    I conceded that point at least 2 days ago.

    Ok, now we're back to the question then: what purpose does it
    serve? It doesn't seem like it can be usefully applied to
    anything other than building traditional mutexes based on spin
    locks, and even then I'm not sure if that's correct (see above).

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to Chris M. Thomasson on Tue Mar 25 23:29:31 2025
    In article <vrvdth$d24n$1@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/25/2025 3:21 PM, Dan Cross wrote:
    In article <vrv483$3ekk$2@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/25/2025 4:20 AM, Dan Cross wrote:
    In article <vrt37r$27c13$4@dont-email.me>,
    Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    On 3/24/2025 1:28 PM, Dan Cross wrote:
    There's no guarantee that the tail is the correct place to
    insert the element, vis the order of the elements.

    https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html

    If you go back and review the thread, you'd see that the problem
    is to remove an element from a queue, and insert it into an
    ordered list. This part of the discussion is concerning the
    latter operation.

    There are lock-free ways to insert into an ordered list, indeed.

    How is that relevant?

    inserting into an ordered list is not relevant?

    The fact that you can use a lock-free algorithm to do so seems
    utterly beside the point.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to monnier@iro.umontreal.ca on Tue Mar 25 23:33:43 2025
    In article <jwvjz8coho7.fsf-monnier+comp.arch@gnu.org>,
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Dan Cross [2025-03-25 22:19:07] wrote:
    From what I gather, you've said to just eschew the atomic event
    stuff and build a traditional lock using those primitives. But
    you have also said that a traditional lock can be silently
    unlocked if the thread holding it (which, apparently, is an
    abstraction known to hardware) is preempted. It will be like
    the lock had never been acquired.

    Here's the misunderstanding: the lock that can be silently taken away
    is the lock implemented in the CPU to run an ATOMIC sequence, not the
    lock implemented as a data-structure on top of it: once you've run >(atomically) your `lock_acquire`, of course that lock is yours. The CPU >doesn't even know that this data-structure is a lock and has no
    way to steal it.

    I hope you are right, and I'd like Mitch to confirm.

    Even so, I don't see the point of the proposed atomic framework.
    It seems to have no significant advantage over a more
    traditional scheme; CAS, LL-SC, whatever seem more general and
    less intertwined with primitives that the hardware has no real
    business injecting itself into (like threads).

    - Dan C.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Mar 26 01:00:46 2025
    On Tue, 25 Mar 2025 22:49:29 +0000, Stefan Monnier wrote:

    Dan Cross [2025-03-25 22:19:07] wrote:
    From what I gather, you've said to just eschew the atomic event
    stuff and build a traditional lock using those primitives. But
    you have also said that a traditional lock can be silently
    unlocked if the thread holding it (which, apparently, is an
    abstraction known to hardware) is preempted. It will be like
    the lock had never been acquired.

    Here's the misunderstanding: the lock that can be silently taken away
    is the lock implemented in the CPU to run an ATOMIC sequence, not the
    lock implemented as a data-structure on top of it: once you've run (atomically) your `lock_acquire`, of course that lock is yours. The CPU doesn't even know that this data-structure is a lock and has no
    way to steal it.

    You are getting at the "why". In Compare-Swap atomic sequences with
    your typical ISA, the HW does not see the initial LD (of the memory participating in the compare-swap instruction) as locked. In the LL-
    SC version, HW does have that tidbit of information.

    With that information, the event is secured both by data and by
    address, preventing the ABA problem {At least in ESM}.



    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Wed Mar 26 00:56:27 2025
    On Tue, 25 Mar 2025 22:19:07 +0000, Dan Cross wrote:

    In article <e4ecfb53ef4f33f10b6d2548d5930159@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    Your stuff sounds fine for getting into the critsec; but it
    doesn't help you once you're in it, and it seems like it may
    hurt you if some higher-priority thing comes along and yanks
    your lock out from under you.

    I conceded that point at least 2 days ago.

    Ok, now we're back to the question then: what purpose does it
    serve?

    You (well Chris) can write critical sections where a low priority
    holder can give way to the higher priority requestor such that
    the low priority holder reverts back to before it entered CS,
    and does not see the lock being stolen.

    For the sizes of CSs this covers--it prevent priority inversion.

    It doesn't seem like it can be usefully applied to
    anything other than building traditional mutexes based on spin
    locks, and even then I'm not sure if that's correct (see above).

    And here I thought you had an imagination .....

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Mar 25 22:58:08 2025
    Even so, I don't see the point of the proposed atomic framework.
    It seems to have no significant advantage over a more
    traditional scheme; CAS, LL-SC, whatever seem more general and
    less intertwined with primitives that the hardware has no real
    business injecting itself into (like threads).

    AFAICT it's somewhat similar to ARM's LL/SC (which can also cover some
    number of cache lines) except that it takes care of looping for you: if
    the final store fails because of interference My 66000 will "simply" abort
    the whole operation (reusing the speculation hardware) and start over,
    instead of having to write an explicit loop.

    Without actual hardware and software to try it's hard to assess how well
    it's going to work, but the design sounds good: it should lead to more
    compact code than LL/SC, be more general/friendly to lock-free coding
    than T&S/C&S, and also compared to LL/SC it should behave better under
    high contention because it is careful to guarantee overall forward
    progress (in case of conflict between two cores, at least one of them
    can finish successfully).

    IIRC there's also some kind of mechanism to retry more cleverly, i.e. so
    that when N cores are in conflicts, not only one of them succeeds, but
    the N-1 remaining cores have a fighting chance of avoiding bumping into
    each other next time around. I don't know the details, tho, and IIUC
    this is not automatic: some relevant info is provided but it's up to the
    code to do something useful with it.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Wed Mar 26 04:08:57 2025
    In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 25 Mar 2025 22:19:07 +0000, Dan Cross wrote:

    In article <e4ecfb53ef4f33f10b6d2548d5930159@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    Your stuff sounds fine for getting into the critsec; but it
    doesn't help you once you're in it, and it seems like it may
    hurt you if some higher-priority thing comes along and yanks
    your lock out from under you.

    I conceded that point at least 2 days ago.

    Ok, now we're back to the question then: what purpose does it
    serve?

    You (well Chris) can write critical sections where a low priority
    holder can give way to the higher priority requestor such that
    the low priority holder reverts back to before it entered CS,
    and does not see the lock being stolen.

    You still have not defined what you mean by, "the lock being
    stolen."

    Tell you what. Earlier, you posted some code for a critical
    section that inserts a node into a linked list. That code is
    written in terms of some primitives (presumably intrinsics,
    though you call them macros); `esmLOCKload`, `esmLOCKprefetch`, `esmINTERFERENCE` and `esmLOCKstore`. Here's that code, slightly
    reformatted:

    // Removes the element `from` from a doubly linked list, and
    // inserts it into a (possibly different) doubly linked list,
    // after the element pointed to be `to`.
    // Assumes <stdbool.h>
    bool
    MoveElement(Element *from, Element *to)
    {
    assert(from != NULL);
    assert(to != NULL);

    Element *from_next = esmLOCKload(from->next);
    Element *from_prev = esmLOCKload(from->prev);
    Element *to_next = esmLOCKload(to->next);
    esmLOCKprefetch(from_next);
    esmLOCKprefetch(from_prev);
    esmLOCKprefetch(to_next);
    if (!esmINTERFERENCE()) {
    // Unlink the "from" node from its list.
    from_prev->next = from_next; // XXX: `from_prev` may be nil?
    from_next->prev = from_prev; // XXX: `from_next` may be nil?
    // Insert the node after "to".
    to->next = from;
    to_next->prev = from; // XXX: `to_net` may be nil?
    from->prev = to;
    esmLOCKstore(from->next, to_next);
    return true;
    }
    return false;
    }

    Leaving aside the unchecked boundary conditions I highlighted
    above, there are other questions that this raises. You
    mentioned that the ESM stuff can use up to 8 cache lines in a
    single atomic "event"; let's count how many this code may
    encounter:

    1. `from` may point to a cache line
    2. `to` may point to a different cache line.
    3. Does your architecture use a stack? What sort of alignment
    criteria do you impose? At first blush, it seems to me that
    the pointers `from_next`, `from_prev`, and `to_next` could be
    on the stack and if so, will be on at least one cache line,
    and possibly 2, if the call frame for `MoveElement` spans
    more than one. Of course, it's impossible to say when this
    is compiled, but maybe you require that the stack is always
    aligned to a cache line boundary or something. Or maybe
    these are all in registers and it's fine.
    4. Depending on the size and alignment criteria of the the
    `Element` structure and the placement of the `next` and
    `prev` elements in the struct (unknowable from this example),
    `to->next` may be on another cache line.
    5. Similarly, `from->next`
    6. And `from->prev`
    7. And `from_prev->next`
    8. And `from_next->prev`
    9. And `to_next->prev`.

    So potentially this code _could_ touch 9 or 10 cache lines, more
    than you said are supported for a single atomic event. What
    happens in that case? Does the code just always return false?
    The programmer best make sure the optimizer is keeping those
    temporary pointers off the stack; good luck with a debug build.

    Anyway, let's assume that's not the case, and we don't hit the
    pathological case with cache lines, and we're not dealing with
    edge cases at the front or end of the lists (so various next or
    prev pointers are not NULL). Why don't I try to explain what I
    think this code is doing, and you tell me whether I'm right or
    wrong.

    If I understand this correctly, the `esmLOCKload`s are simply
    loads, but they make sure that the compiler emits instructions
    that tie into your atomic support framework. Earlier you wrote, "esmINTERFERENCE is a query on the state of th event and whether
    a 2nd party has performed a memory action that kills the event."
    Ok, I take this to mean _at the time of the store_, which
    appears later in the program. If it were simply a query at the
    time it was written in the code, then it seems like it opens a
    TOCTOU bug (what if something interfers after the check, but
    before the the locked store at the end?). Or perhaps it signals
    to the architecture that this is the point to which the
    processor should rollback, if anything subsequently fails; sort
    of a `setjmp` kind of thing that conceptually returns twice.

    Assuming nothing has poisoned the event, the various pointer
    stores inside the conditional are conditionally performed; the
    `esmLOCKstore` to `from->next` commits the event, at which
    point all side effects are exposed and durable.

    Is that correct? Let's suppose that it is.

    But later you wrote,

    So, if you want the property whereby the lock disappears on any
    control transfer out of the event {exception, interrupt, SVC, SVR, ...};
    then you want to use my ATOMIC stuff; otherwise, you can use the
    normal ATOMIC primitives everyone and his brother provide.

    Well, what precisely do you mean here when you say, "if you want
    the property whereby the lock disappears on any control transfer
    out of the event"?

    What I _hope_ you mean is that if you transfer "out of the
    event" before the `esmLOCKstore` then anything that's been done
    since the "event" started is rolled back, but NOT if control
    transfer happens _after_ the `esmLOCKstore`. If THAT were the
    case, then this entire mechanism is useless.

    Perhaps you would clarify?

    But for the time being, let's assume that `esmLOCKstore` is
    conceptually a commit that ends the atomic event, and the store
    is durable and observable after that. Let's see if we can
    construct a simple spinlock from these primitives.

    typedef struct Spinlock Spinlock;
    struct Spinlock {
    uint32_t lock;
    };

    uint32_t
    xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
    {
    uint32_t v = esmLOCKload(&lk->lock);
    if (v == expected && !esmINTERFERENCE()) {
    esmLOCKstore(&lk->lock, val)
    }
    return v;
    }

    void
    spin_acquire(volatile Spinlock *lk)
    {
    splhi();
    while (xcas(lk, 0, 1) != 0)
    relax(); // Spin loop intrinsic.
    }

    void
    spin_release(volatile Spinlock *lk)
    {
    xcas(lk, 1, 0);
    // Maybe we should `spllo` here, maybe not. Either choice
    // reflects lots of assumptions about whether or not we might
    // take and hold multiple locks concurrently, etc.
    spllo();
    }

    Does this seem correct? It is not immediately clear to me
    whether `esmINTERFERENCE` is actually needed here, but
    avoiding the store seemed fine. One could imagine shorting
    `xcas` to simply:

    uint32_t
    xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
    {
    uint32_t v = esmLOCKload(&lk->lock);
    if (v == expected) esmLOCKstore(&lk->lock, val);
    return v;
    }

    or perhaps:

    uint32_t
    xswap(volatile Spinlock *lk, uint32_t val)
    {
    uint32_t v = esmLOCKload(&lk->lock);
    esmLOCKstore(&lk->lock, val);
    return v;
    }

    But that all depends on the specifics of the atomics.

    For the sizes of CSs this covers--it prevent priority inversion.

    Except you've admitted that this is not a general solution, as
    it cannot "cover" arbitrary software-defined critical sections:
    the window isn't big enough and we've seen how even a fairly
    small problem may exceed the number of cache lines supported by
    one of these events.

    So I have to implement solutions to priority inversion problems
    outside of your scheme anyway. Clearly, whatever mechanism is
    chosen to do so is, by definition, also general enough to cover
    the use cases for which your atomic scheme is best suited.

    Plus, as you stated, this whole thing is wrapped up in notions
    of threads and priorities that appear to try and subsume some
    functionality from the OS, whether it's a good fit for a
    particular OS's expectations or not; as a simple example, the
    priorities of threads can change over time, etc, and it's not
    clear how to keep that state in sync between the OS and
    hardware, and so on.

    So now I, as a systems software person, have to ask design
    question around how I want to use these architectural features,
    if at all, weighing various pros and cons. This is where I'm
    struggling with your proposal: given that I'm going to have to
    implement more general support for the things that the
    architecture gives me anyway, what do I gain by also supporting
    the specialized mechanisms? Particularly if, in order to use
    them, I'm also forced into a "shape" for my system that's
    dictated by the hardware, but that may not be the best for my
    software? Is whatever the hardware gives me worth the tradeoff?

    Right now, I'm not seeing it that it is. "It solves priority
    inversion for the size of problems that it's applicable to" is
    not enough to justify the additional cost of adoption. I have
    to solve that problem anyway, whether I use this or not.

    It doesn't seem like it can be usefully applied to
    anything other than building traditional mutexes based on spin
    locks, and even then I'm not sure if that's correct (see above).

    And here I thought you had an imagination .....

    You admitted you're not an OS person and asked other people to
    poke holes in your proposal. I'm sorry if you find the holes
    I'm poking upsetting.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to monnier@iro.umontreal.ca on Wed Mar 26 04:40:43 2025
    In article <jwv34f0o75h.fsf-monnier+comp.arch@gnu.org>,
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Even so, I don't see the point of the proposed atomic framework.
    It seems to have no significant advantage over a more
    traditional scheme; CAS, LL-SC, whatever seem more general and
    less intertwined with primitives that the hardware has no real
    business injecting itself into (like threads).

    AFAICT it's somewhat similar to ARM's LL/SC (which can also cover some
    number of cache lines) except that it takes care of looping for you: if
    the final store fails because of interference My 66000 will "simply" abort >the whole operation (reusing the speculation hardware) and start over, >instead of having to write an explicit loop.

    Without actual hardware and software to try it's hard to assess how well
    it's going to work, but the design sounds good: it should lead to more >compact code than LL/SC, be more general/friendly to lock-free coding
    than T&S/C&S, and also compared to LL/SC it should behave better under
    high contention because it is careful to guarantee overall forward
    progress (in case of conflict between two cores, at least one of them
    can finish successfully).

    Ah, thanks, Stefan. That makes more sense.

    Perhaps some day there will be hardware and systems software and
    we can see how it actually works out in practice. I remain
    skeptical given the example code that was posted, and it appears
    that the claims are still just that: claims.

    Regardless, if anyone has the sort of track record to
    demonstrate that they can pull it off on the hardware side,
    Alsup certainly does, so maybe my skepticism is overly
    pessimistic and unwarranted.

    - Dan C.

    IIRC there's also some kind of mechanism to retry more cleverly, i.e. so
    that when N cores are in conflicts, not only one of them succeeds, but
    the N-1 remaining cores have a fighting chance of avoiding bumping into
    each other next time around. I don't know the details, tho, and IIUC
    this is not automatic: some relevant info is provided but it's up to the
    code to do something useful with it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Mar 19 13:54:01 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 18 Mar 2025 18:51:57 +0000, EricP wrote:


    This affects what it can do with PCR inside the WoU if it causes a state
    change that cannot be undone, for example, reading the interrupt data
    FIFO.

    Since the device is accessed through MMI/O space, and since My 66000
    demands sequential consistency for MMI/O accesses, the LD of the
    interrupt data FIFO cannot leave the confines of the core until
    all pending memory references from the interrupted thread have
    "been performed".

    Be careful when relying on the ordering of outbound MMI/O accesses. They're not
    ordered with respect to DMA accesses initiated from the device.

    With PCI Express, software can also enable RO (Relaxed Ordering) which
    changes the normal fairly strict ordering rules, particulary with respect
    to non-posted transactions vis a vis posted transactions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Mar 19 13:49:54 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 18 Mar 2025 21:59:51 +0000, Robert Finch wrote:

    On 2025-03-18 2:51 p.m., EricP wrote:
    ----------------

    Seems like it could work. There are difficulties using counters with the
    Q+ ISA as multiple interrupt enable bits could be manipulated at the
    same time. One interrupt may be enabled and another one disabled by the
    same instruction.

    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    Because sometimes you need to. For example, when making a configuration
    change to the device, you'll want to disable interrupts that may be
    generated as a result of the configuration change (e.g. when flushing
    a FIFO on the device, you may not want the 'FIFO Empty' interrupt
    to be asserted).

    Granted, there might be an enable gate on the device itself
    that can be used for that, but generally speaking there are other cases
    where it will make sense to disable delivery at the interrupt controller
    or core (the later, particularly, during certain critical sections).


    When running at any privilege, do you not want to accept an interrupt
    at higher priority with as little delay as possible ??

    Not while the core is executing an OS/hypervisor/device driver critical
    section sensitive to interrupts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Wed Mar 19 14:03:56 2025
    In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    So that you can have some mechanism for building critical
    sections that span multiple instructions, where just having
    access to atomics and spin locks isn't enough.

    When running at any privilege, do you not want to accept an interrupt
    at higher priority with as little delay as possible ??

    Usually the assumption is that the critical section is short; a
    handful of instructions, perhaps, so that the cost of blocking a
    high priority interrupt is low.

    Does Interrupt Disable help in this goal ??

    This is predicated on an incomplete view of the situation and
    the intersection between hardware and software. Serving high
    priority interrupts quickly is not the only consideration.

    Is there a SANE way to avoid disabling of interrupts ??

    It depends on your definition of "sane" in this context. One
    may imagine redesigning all of a system's locking primitives,
    perhaps, to avoid the need to block interrupts in most cases.
    But that pushes a lot of complexity onto software for dubious
    benefit, and even then, it's not clear to me that they can be
    blocked in _all_ cases.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Mar 19 11:20:20 2025
    MitchAlsup1 [2025-03-17 21:49:18] wrote:
    On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:
    Method 1 is simplest, it injects the interrupt request at Retire
    as that's where the state of everything is synchronized.
    [...]
    Lowest interrupt Latency
    Highest waste of power (i.e., work)

    Method 2 pipelines the switch by injecting the interrupt request at
    Fetch.
    [...]
    Interrupt latency is dependent on executing instructions,
    Lowest waste of power

    What do existing CPUs do?
    Method 2 looks insanely messy, so I'd assume method 1 is the standard
    approach, right?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Mar 19 15:54:26 2025
    On Wed, 19 Mar 2025 13:49:54 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 18 Mar 2025 21:59:51 +0000, Robert Finch wrote:

    On 2025-03-18 2:51 p.m., EricP wrote:
    ----------------

    Seems like it could work. There are difficulties using counters with the >>> Q+ ISA as multiple interrupt enable bits could be manipulated at the
    same time. One interrupt may be enabled and another one disabled by the
    same instruction.

    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    Because sometimes you need to. For example, when making a configuration change to the device, you'll want to disable interrupts that may be
    generated as a result of the configuration change (e.g. when flushing
    a FIFO on the device, you may not want the 'FIFO Empty' interrupt
    to be asserted).

    The above is more quickly performed by raising the priority level above
    those of interrupts a device may send.

    {{This is part of my whole point--one can use priority as if it were enable/disable, and once you get priority setup as you perform inter-
    rupt control transfer, you no longer need DI or EI.}}

    Granted, there might be an enable gate on the device itself

    As illustrated before, there are at least 7 gates between device and
    a core.

    that can be used for that, but generally speaking there are other cases
    where it will make sense to disable delivery at the interrupt controller
    or core (the later, particularly, during certain critical sections).


    When running at any privilege, do you not want to accept an interrupt
    at higher priority with as little delay as possible ??

    Not while the core is executing an OS/hypervisor/device driver critical section sensitive to interrupts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Mar 19 15:56:53 2025
    On Wed, 19 Mar 2025 13:54:01 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 18 Mar 2025 18:51:57 +0000, EricP wrote:


    This affects what it can do with PCR inside the WoU if it causes a state >>> change that cannot be undone, for example, reading the interrupt data
    FIFO.

    Since the device is accessed through MMI/O space, and since My 66000 >>demands sequential consistency for MMI/O accesses, the LD of the
    interrupt data FIFO cannot leave the confines of the core until
    all pending memory references from the interrupted thread have
    "been performed".

    Be careful when relying on the ordering of outbound MMI/O accesses.
    They're not
    ordered with respect to DMA accesses initiated from the device.

    When device DMA gets translated into MMI/O space my HostBridge will
    treat it as sequentially consistent.

    When device DMA gets translated into DRAM space my HostBridge will
    treat it as casually ordered.

    With PCI Express, software can also enable RO (Relaxed Ordering) which changes the normal fairly strict ordering rules, particulary with
    respect
    to non-posted transactions vis a vis posted transactions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Wed Mar 19 15:59:08 2025
    On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:

    In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    So that you can have some mechanism for building critical
    sections that span multiple instructions, where just having
    access to atomics and spin locks isn't enough.

    You can do this with priority, too.

    When running at any privilege, do you not want to accept an interrupt
    at higher priority with as little delay as possible ??

    Usually the assumption is that the critical section is short; a
    handful of instructions, perhaps, so that the cost of blocking a
    high priority interrupt is low.

    Does Interrupt Disable help in this goal ??

    This is predicated on an incomplete view of the situation and
    the intersection between hardware and software. Serving high
    priority interrupts quickly is not the only consideration.

    Is there a SANE way to avoid disabling of interrupts ??

    It depends on your definition of "sane" in this context. One
    may imagine redesigning all of a system's locking primitives,
    perhaps, to avoid the need to block interrupts in most cases.
    But that pushes a lot of complexity onto software for dubious
    benefit, and even then, it's not clear to me that they can be
    blocked in _all_ cases.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Mar 19 16:14:45 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 19 Mar 2025 13:49:54 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 18 Mar 2025 21:59:51 +0000, Robert Finch wrote:

    On 2025-03-18 2:51 p.m., EricP wrote:
    ----------------

    Seems like it could work. There are difficulties using counters with the >>>> Q+ ISA as multiple interrupt enable bits could be manipulated at the
    same time. One interrupt may be enabled and another one disabled by the >>>> same instruction.

    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    Because sometimes you need to. For example, when making a configuration
    change to the device, you'll want to disable interrupts that may be
    generated as a result of the configuration change (e.g. when flushing
    a FIFO on the device, you may not want the 'FIFO Empty' interrupt
    to be asserted).

    The above is more quickly performed by raising the priority level above
    those of interrupts a device may send.

    Most device interrupts have the same priority. So your solution blocks
    all interrupts effectively, even though only the interrupt for the particular device owned by the device driver needs to be blocked. Enabling really needs to be done on a per-device-instance basis.

    There aren't enough priorities to have a separate priority for each
    device instance (e.g. 1000 virtual functions owned by a single PF,
    each with 10 individual interrupts).

    If every device was required to have different priorities, you have
    a nightmare scenario just trying to figure out which device interrupt
    should be where in the priority hierarchy to avoid potential deadlocks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Mar 19 16:00:43 2025
    On Wed, 19 Mar 2025 15:20:20 +0000, Stefan Monnier wrote:

    MitchAlsup1 [2025-03-17 21:49:18] wrote:
    On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:
    Method 1 is simplest, it injects the interrupt request at Retire
    as that's where the state of everything is synchronized.
    [...]
    Lowest interrupt Latency
    Highest waste of power (i.e., work)

    Method 2 pipelines the switch by injecting the interrupt request at
    Fetch.
    [...]
    Interrupt latency is dependent on executing instructions,
    Lowest waste of power

    What do existing CPUs do?

    Lean in the direction of method 2 will fallback position of method 1
    if anything goes wrong.

    Method 2 looks insanely messy, so I'd assume method 1 is the standard approach, right?

    It is "not that bad" on the large scale of things.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Wed Mar 19 13:07:05 2025
    Robert Finch wrote:

    I am going for a simple approach. Since the CSR instruction is used to manipulate the interrupt enable bits in the status register, if there is
    any CSR instruction in the pipeline, then interrupts will be masked off
    until the CSR clears. One issue with this type of approach is evil
    software could issue a lot of CSRs preventing interrupt servicing from happening. I Could fine tune the decode to sort out CSRs just affecting
    the interrupt flags. It always seems to be more logic.

    KISS is usually best - just single step the major actions.

    I added single step to my uArch with a NotifyDecode flag in the uOp.
    To single step Decode emits a special nop uOp with a NotifyDecode bit
    and sets a Stop flip-flop that stalls Decode. The Fetch buffer fills up
    and Fetch stalls itself. When the special nop reaches Retire we know all
    older instructions are done. Retire sees the NotifyDecode bit and pulses
    a wire that clears the Decode Stop FF.

    Decode then emits the single step instruction uOp again with the
    NotifyDecode bit and sets its Stop FF. When that uOp reaches Retire
    it pulses the wire and clears the Decode Stop flag, which reads the
    Fetch buffer and restarts the front end.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Mar 19 19:03:58 2025
    On Wed, 19 Mar 2025 16:14:45 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 19 Mar 2025 13:49:54 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 18 Mar 2025 21:59:51 +0000, Robert Finch wrote:

    On 2025-03-18 2:51 p.m., EricP wrote:
    ----------------

    Seems like it could work. There are difficulties using counters with the >>>>> Q+ ISA as multiple interrupt enable bits could be manipulated at the >>>>> same time. One interrupt may be enabled and another one disabled by the >>>>> same instruction.

    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    Because sometimes you need to. For example, when making a configuration >>> change to the device, you'll want to disable interrupts that may be
    generated as a result of the configuration change (e.g. when flushing
    a FIFO on the device, you may not want the 'FIFO Empty' interrupt
    to be asserted).

    The above is more quickly performed by raising the priority level above >>those of interrupts a device may send.

    Most device interrupts have the same priority. So your solution blocks
    all interrupts effectively,

    Only to the core that took the interrupt, other cores remain to take
    those other exceptions--and only for the duration of the interrupt
    processing. So if you have 100 cores, and one is processing SATA[k]
    on core[j], there are 99 other cores that can take any pending
    interrupt from that or other interrupt tables.

    even though only the interrupt for the particular device owned by the device driver needs to be blocked.* Enabling really needs to be done on a per-device-instance basis.

    Are you now arguing that interrupt en/disable should be processed at the device-instance ?? that is far from the core !!

    Certainly if the device-instance has not been assigned to a core/VM
    then it should not be sending interrupts (to anyone). This can be done
    at any of the first 4 layers of interrupt enablement. Heck--it is
    unlikely
    to have an interrupt table assigned, and would get terminated by the
    I/O MMU (if not before).

    But during the assignment to a core/VM the device has the routing
    (I/O MMU) to send interrupts that can be taken by a core running
    that GuestOS/VM.

    There aren't enough priorities to have a separate priority for each
    device instance (e.g. 1000 virtual functions owned by a single PF,
    each with 10 individual interrupts).

    Priority only does the second order of the sort, allowing certain
    interrupts to bypass in front of other interrupts. Interrupt table selection/assignment does a first level of sort. Then who is looking
    at which table performs the third layer of the sort/routing.

    If every device was required to have different priorities, you have
    a nightmare scenario just trying to figure out which device interrupt
    should be where in the priority hierarchy to avoid potential deadlocks.

    Multiple device-instances will use different interrupt tables as each GuestOS/VM has at least its own interrupt table. In which case, the
    priority in one table is orthogonal to the priority in another table.

    (*) comming back to::
    even though only the interrupt for the particular device owned by the device driver needs to be blocked.*

    The interrupts are queued and until the priority of the core drops below
    that of the pending interrupts, the driver is not stimulated into
    activity
    again (recursively or reentrantly).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Wed Mar 19 19:17:31 2025
    On Wed, 19 Mar 2025 17:27:46 +0000, Robert Finch wrote:

    On 2025-03-19 12:00 p.m., MitchAlsup1 wrote:
    On Wed, 19 Mar 2025 15:20:20 +0000, Stefan Monnier wrote:

    MitchAlsup1 [2025-03-17 21:49:18] wrote:
    On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:
    Method 1 is simplest, it injects the interrupt request at Retire
    as that's where the state of everything is synchronized.
    [...]
    Lowest interrupt Latency
    Highest waste of power (i.e., work)

    Method 2 pipelines the switch by injecting the interrupt request at
    Fetch.
    [...]
    Interrupt latency is dependent on executing instructions,
    Lowest waste of power

    What do existing CPUs do?

    Lean in the direction of method 2 will fallback position of method 1
    if anything goes wrong.

    Method 2 looks insanely messy, so I'd assume method 1 is the standard
    approach, right?

    It is "not that bad" on the large scale of things.

    Undaunted he tries to implement the method 2 approach using an IRQ
    victim queue for IRQs that should not be processed at retire. It is not
    too bad to code, but I dunno if it works yet. Just a switch at fetch
    time to select from the victim queue first, otherwise pop the normal IRQ queue. The queue is small, assuming that IRQs are not disabled for great lengths of time.

    Interrupt table MSI-X message undergoes negotiation.

    One possible result of interrupt negotiation is that core can reject
    said
    interrupt, and message is returned to the interrupt table.

    During this time, core is not doing anything but running its previously assigned thread. The negotiation takes place in state machines more
    optimally positioned for throughput and latency than the function units
    within cores. So, core has not done anything speculative wrt IROing
    except that which has a well developed protocol.

    Now, once core claims the interrupt, it performs as if it were the
    typical von Neumann "perform 1 instructions then move on" Mantra
    enforced by boatloads of reservations stations and predictors and
    prediction repair mechanisms. Accessing memory with side effects
    is not one of those mechanisms and must be ordered properly wrt
    {interrupts, exceptions, mispredictions, and program order}.

    But on the scale of a FDIV unit that sends out a 0.51 ULP result
    on cycle 12 and can send out the 0.500 ULP result on cycle 17,
    and fixing the intermediate damage--handing off to an interrupt
    handler is "not that bad".


            Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Wed Mar 19 23:28:27 2025
    On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:

    In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    So that you can have some mechanism for building critical
    sections that span multiple instructions, where just having
    access to atomics and spin locks isn't enough.

    For what kinds of activities are these critical sections ??

    For example, I can schedule a DPC/sofIRQ without using a critical
    section (and so can even user threads without privilege, given
    the MMU allows). So, we need to dive through the "rest of the
    available mechanics" -- keyword available !

    There are mechanisms available that readers may not be aware of
    or quite grasp how a feature can be used.

    So it comes back to:: For what kinds of activities are these
    critical sections ?? We know about DPC/softIRQs, we know
    about atomic events around core scheduling, and inter-{thread,
    process, core} communication. Others ?

    When running at any privilege, do you not want to accept an interrupt
    at higher priority with as little delay as possible ??

    Usually the assumption is that the critical section is short; a
    handful of instructions, perhaps, so that the cost of blocking a
    high priority interrupt is low.

    I think you are talking about critical sections that are not allowed
    to fail or be interfered with:: Yes ?

    Does Interrupt Disable help in this goal ??

    This is predicated on an incomplete view of the situation and
    the intersection between hardware and software. Serving high
    priority interrupts quickly is not the only consideration.

    Is there a SANE way to avoid disabling of interrupts ??

    It depends on your definition of "sane" in this context. One
    may imagine redesigning all of a system's locking primitives,
    perhaps, to avoid the need to block interrupts in most cases.
    But that pushes a lot of complexity onto software for dubious
    benefit, and even then, it's not clear to me that they can be
    blocked in _all_ cases.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Mar 26 09:55:36 2025
    Leaving aside the unchecked boundary conditions I highlighted
    above, there are other questions that this raises. You
    mentioned that the ESM stuff can use up to 8 cache lines in a
    single atomic "event"; let's count how many this code may
    encounter:

    1. `from` may point to a cache line
    2. `to` may point to a different cache line.
    3. Does your architecture use a stack? What sort of alignment
    criteria do you impose? At first blush, it seems to me that
    the pointers `from_next`, `from_prev`, and `to_next` could be
    on the stack and if so, will be on at least one cache line,
    and possibly 2, if the call frame for `MoveElement` spans
    more than one. Of course, it's impossible to say when this
    is compiled, but maybe you require that the stack is always
    aligned to a cache line boundary or something. Or maybe
    these are all in registers and it's fine.

    AFAIK the stack accesses will use normal loads that won't be tracked for conflicts, so they don't count towards the 8 cache-lines limit.

    4. Depending on the size and alignment criteria of the the
    `Element` structure and the placement of the `next` and
    `prev` elements in the struct (unknowable from this example),
    `to->next` may be on another cache line.

    The coder may need to be careful with such slot placement and
    alignment, indeed.

    So potentially this code _could_ touch 9 or 10 cache lines, more
    than you said are supported for a single atomic event. What
    happens in that case? Does the code just always return false?

    I believe the code will indeed always return false in that case.

    If I understand this correctly, the `esmLOCKload`s are simply
    loads, but they make sure that the compiler emits instructions
    that tie into your atomic support framework. Earlier you wrote, "esmINTERFERENCE is a query on the state of the event and whether
    a 2nd party has performed a memory action that kills the event."
    Ok, I take this to mean _at the time of the store_, which
    appears later in the program.

    No, I believe it is at the time when `esmINTERFERENCE` is called.

    If it were simply a query at the time it was written in the code, then
    it seems like it opens a TOCTOU bug (what if something interfers after
    the check, but before the the locked store at the end?).

    IIUC the way this works is that in case there's a conflict later, the
    other core will get a NACK (i.e. we'll win and they'll lose, even if
    they have a higher priority) so we're guaranteed that we'll finish our transaction successfully. Of course, that can be guaranteed only if we
    already have "locked" the cache lines we'll need, which is what the `esmLOCKprefetch` calls are about.

    IOW the `esmINTERFERENCE` call is the time where the CPU decides whether
    to "commit", and then the code between `esmINTERFERENCE` and
    `esmLOCKstore` is the code which actually stores the changes we
    decided to commit.

    IIUC, the `esmINTERFERENCE` call is not needed, in which case the
    operation should retry as many times as needed to succeed.

    Assuming nothing has poisoned the event, the various pointer
    stores inside the conditional are conditionally performed; the
    `esmLOCKstore` to `from->next` commits the event, at which
    point all side effects are exposed and durable.

    The `esmLOCKstore` marks the end of the transaction. IIUC it can also
    mark the "commit" but only if `esmINTERFERENCE` isn't used.

    So now I, as a systems software person, have to ask design
    question around how I want to use these architectural features,
    if at all, weighing various pros and cons. This is where I'm
    struggling with your proposal: given that I'm going to have to
    implement more general support for the things that the
    architecture gives me anyway, what do I gain by also supporting
    the specialized mechanisms?

    I think the expectation is that you'd *use* those specialized mechanisms
    but you wouldn't *support* them.

    Right now, I'm not seeing it that it is. "It solves priority
    inversion for the size of problems that it's applicable to" is
    not enough to justify the additional cost of adoption. I have
    to solve that problem anyway, whether I use this or not.

    The ESM stuff is what you'll use at the assembly-level instead of T&S,
    C&S, LL/SC because that's what the CPU supports/provides. But if you
    prefer to code against T&S, C&S, LL/SC, AFAIK it'll work as well,
    because you can "simulate" them easily with ESM.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to monnier@iro.umontreal.ca on Wed Mar 26 14:57:12 2025
    In article <jwv4izfncov.fsf-monnier+comp.arch@gnu.org>,
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Leaving aside the unchecked boundary conditions I highlighted
    above, there are other questions that this raises. You
    mentioned that the ESM stuff can use up to 8 cache lines in a
    single atomic "event"; let's count how many this code may
    encounter:

    1. `from` may point to a cache line
    2. `to` may point to a different cache line.
    3. Does your architecture use a stack? What sort of alignment
    criteria do you impose? At first blush, it seems to me that
    the pointers `from_next`, `from_prev`, and `to_next` could be
    on the stack and if so, will be on at least one cache line,
    and possibly 2, if the call frame for `MoveElement` spans
    more than one. Of course, it's impossible to say when this
    is compiled, but maybe you require that the stack is always
    aligned to a cache line boundary or something. Or maybe
    these are all in registers and it's fine.

    AFAIK the stack accesses will use normal loads that won't be tracked for >conflicts, so they don't count towards the 8 cache-lines limit.

    Ok, good to go.

    4. Depending on the size and alignment criteria of the the
    `Element` structure and the placement of the `next` and
    `prev` elements in the struct (unknowable from this example),
    `to->next` may be on another cache line.

    The coder may need to be careful with such slot placement and
    alignment, indeed.

    Yup. This could easily happen. Consider,

    #include <stdalign.h>
    #include <stdint.h>
    #include <stdio.h>

    struct elem {
    struct elem *next;
    struct elem *prev;
    uint32_t data;
    };

    struct elem arr[40];

    int
    main()
    {
    printf("sizeof elem: %zu\n", sizeof(struct elem));
    printf("alignof elem: %zu\n", alignof(struct elem));
    printf("sizeof(arr): %zu\n", sizeof(arr));
    return 0;
    }

    On, say, x86_64 with the SVR4 ABI, this has alignment 8 and
    sizeof 24; presumably cache line size is a power of two, so
    some of the elements of that array are going to spill over a
    cache line.

    So potentially this code _could_ touch 9 or 10 cache lines, more
    than you said are supported for a single atomic event. What
    happens in that case? Does the code just always return false?

    I believe the code will indeed always return false in that case.

    That doesn't seem great. Surely there must be some way to
    detect that the code did not meet the criteria for an atomic
    event? What is the software recourse for recovery in this case?

    If I understand this correctly, the `esmLOCKload`s are simply
    loads, but they make sure that the compiler emits instructions
    that tie into your atomic support framework. Earlier you wrote,
    "esmINTERFERENCE is a query on the state of the event and whether
    a 2nd party has performed a memory action that kills the event."
    Ok, I take this to mean _at the time of the store_, which
    appears later in the program.

    No, I believe it is at the time when `esmINTERFERENCE` is called.

    If it were simply a query at the time it was written in the code, then
    it seems like it opens a TOCTOU bug (what if something interfers after
    the check, but before the the locked store at the end?).

    IIUC the way this works is that in case there's a conflict later, the
    other core will get a NACK (i.e. we'll win and they'll lose, even if
    they have a higher priority) so we're guaranteed that we'll finish our >transaction successfully. Of course, that can be guaranteed only if we >already have "locked" the cache lines we'll need, which is what the >`esmLOCKprefetch` calls are about.

    IOW the `esmINTERFERENCE` call is the time where the CPU decides whether
    to "commit", and then the code between `esmINTERFERENCE` and
    `esmLOCKstore` is the code which actually stores the changes we
    decided to commit.

    Curious. So what happens if a higher priority interrupt comes
    in after the `esmINTERFERENCE` call, but before `esmLOCKstore`?

    Earlier, Mitch wrote:

    |My architecture has a mechanism to perform ATOMIC stuff over multiple |instruction time frames that has the property that a higher priority
    |thread which interferers with a lower priority thread, the HPT wins
    |and the LPT fails its ATOMIC event. It is for exactly this reason
    |that I drag priority through the memory hierarchy--so that real-time
    |things remain closer to real-time (no or limited priority inversions).

    This implies to me that the higher-prio interrupt would take
    precedence, preempting the lower-priority critical section, and
    the hardware would reuse the branch-prediction repair facility
    to undo the effects of the atomic event before the store. But
    what you wrote suggests that instead, perhaps, the higher
    priority interrupt would be deferred until the store, in which
    case, what happens if the code does something goofy like,

    if (!esmINTERFERENCE())
    for(;;) { /* Priority inversion */ }

    I suppose this hinges on what it means for a thread to "fail its
    ATOMIC event".

    The thing that makes the most sense to me, given what you said
    about the hardware automatically retrying, is that the point of
    failure would be at the store. If that fails, the entire thing
    is rewound and restarted from the first `esmLOCKload` since the
    last `esmLOCKstore`; if `esmINTERFERENCE` returns true, it's
    because `esmLOCKstore` has already been attempted and failed.

    That would address the, "bonehead spins forever with interrupts
    deferred" thing.

    IIUC, the `esmINTERFERENCE` call is not needed, in which case the
    operation should retry as many times as needed to succeed.

    Ok. So,

    void
    spinlock_acquire(volatile Spinlock *lk)
    {
    while (esmLOCKload(&lk->lock) != 0)
    relax();
    esmLOCKstore(&lk->lock, 1);
    }

    Should be sufficient?

    Assuming nothing has poisoned the event, the various pointer
    stores inside the conditional are conditionally performed; the
    `esmLOCKstore` to `from->next` commits the event, at which
    point all side effects are exposed and durable.

    The `esmLOCKstore` marks the end of the transaction. IIUC it can also
    mark the "commit" but only if `esmINTERFERENCE` isn't used.

    So now I, as a systems software person, have to ask design
    question around how I want to use these architectural features,
    if at all, weighing various pros and cons. This is where I'm
    struggling with your proposal: given that I'm going to have to
    implement more general support for the things that the
    architecture gives me anyway, what do I gain by also supporting
    the specialized mechanisms?

    I think the expectation is that you'd *use* those specialized mechanisms
    but you wouldn't *support* them.

    Support in the sense that if I have software that uses them, I
    need to support that software.

    Right now, I'm not seeing it that it is. "It solves priority
    inversion for the size of problems that it's applicable to" is
    not enough to justify the additional cost of adoption. I have
    to solve that problem anyway, whether I use this or not.

    The ESM stuff is what you'll use at the assembly-level instead of T&S,
    C&S, LL/SC because that's what the CPU supports/provides. But if you
    prefer to code against T&S, C&S, LL/SC, AFAIK it'll work as well,
    because you can "simulate" them easily with ESM.

    The properties you mentioned earlier vis automatic retries and
    code density sound attractive. The rest surrounding it, such as
    the hardware's notion of a thread, priority, etc, less so. On
    balance I remain skeptical that the tradeoff is worth it.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Mar 26 15:41:02 2025
    uint32_t
    xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
    {
    uint32_t v = esmLOCKload(&lk->lock);
    if (v == expected) esmLOCKstore(&lk->lock, val);
    return v;
    }

    Yes, unless you want the alternate exit strategy, this works fine.

    How does ESM know that the "event" is complete when the code doesn't go
    through the final `esmLOCKstore`?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Mar 26 19:42:09 2025
    On Wed, 26 Mar 2025 2:58:08 +0000, Stefan Monnier wrote:

    Even so, I don't see the point of the proposed atomic framework.
    It seems to have no significant advantage over a more
    traditional scheme; CAS, LL-SC, whatever seem more general and
    less intertwined with primitives that the hardware has no real
    business injecting itself into (like threads).

    AFAICT it's somewhat similar to ARM's LL/SC (which can also cover some
    number of cache lines) except that it takes care of looping for you: if
    the final store fails because of interference My 66000 will "simply"
    abort
    the whole operation (reusing the speculation hardware) and start over, instead of having to write an explicit loop.

    Without actual hardware and software to try it's hard to assess how well
    it's going to work, but the design sounds good: it should lead to more compact code than LL/SC, be more general/friendly to lock-free coding
    than T&S/C&S, and also compared to LL/SC it should behave better under
    high contention because it is careful to guarantee overall forward
    progress (in case of conflict between two cores, at least one of them
    can finish successfully).

    IIRC there's also some kind of mechanism to retry more cleverly, i.e. so
    that when N cores are in conflicts, not only one of them succeeds, but
    the N-1 remaining cores have a fighting chance of avoiding bumping into
    each other next time around. I don't know the details, tho, and IIUC
    this is not automatic: some relevant info is provided but it's up to the
    code to do something useful with it.

    ESM operates in 3 modes {optimistic, Greedy, Methodological}

    In the optimistic mode, core tries to execute the instructions as
    if there were no ATOMIC activities and if it succeeds, then these
    ATOMIC instructions took no more time than if they were not ATOMIC
    at all.

    Greedy:: if there is a failure in optimistic mode, the core switches
    to Greedy Mode, applies sequential consistency to memory accesses,
    and is allowed to NaK lower priority SNOOPs if it has decoded all
    the way to the esmLOCKstore(). A second SNOOP to an already SNOOPed
    cache line fails the event and switches mode to methodological.

    Methodological: When Greedy fails, core switches to Methodological.
    The addresses of the participating lines are bundled up into a
    message to the system arbitor, who uses a TLB-like structure to
    GRANT exclusive access to up to 8 cache lines, returning a message
    to core. That message contains a count of cores ahead of this one
    in contention. SW can use count to proactively decrease interference.
    At the end of the event (both success and fail) the same message
    is sent to the arbitor to remove these lines from its grant table.
    Grants are en-massé and deGrants are en-massé.

    After any success, ESM reverts to Optimistic. SW does not see these
    modes in operation.

    After any context-switch ESM reverts to Optimistic.

    When SW arrives at a fail point, SW can read the WHY register (16-bits) positive numbers are a count of events in front of this one, negative
    numbers are an enumeration of spurious failures, zero is success.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Mar 26 15:50:14 2025
    Remember only the lines participating in the event count. For
    example, say one had sprintf() inside the event. None of the
    lines touched by sprint() are participating in the event,
    and are not counted against the 8 cache lines available.

    Hmm... so if I need to add a "participating line" to which I need to
    store but that I don't need to read, I still need to do a "useless" `esmLOCKload` before the (normal) store, just to mark the line as
    participating in the event?
    I guess it's a rather rare need.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Wed Mar 26 19:21:38 2025
    On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:

    In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    -----------------------

    You (well Chris) can write critical sections where a low priority
    holder can give way to the higher priority requestor such that
    the low priority holder reverts back to before it entered CS,
    and does not see the lock being stolen.

    You still have not defined what you mean by, "the lock being
    stolen."

    Tell you what. Earlier, you posted some code for a critical
    section that inserts a node into a linked list. That code is
    written in terms of some primitives (presumably intrinsics,
    though you call them macros); `esmLOCKload`, `esmLOCKprefetch`, `esmINTERFERENCE` and `esmLOCKstore`. Here's that code, slightly reformatted:

    // Removes the element `from` from a doubly linked list, and
    // inserts it into a (possibly different) doubly linked list,
    // after the element pointed to be `to`.
    // Assumes <stdbool.h>
    bool
    MoveElement(Element *from, Element *to)
    {
    assert(from != NULL);
    assert(to != NULL);

    Element *from_next = esmLOCKload(from->next);
    Element *from_prev = esmLOCKload(from->prev);
    Element *to_next = esmLOCKload(to->next);
    esmLOCKprefetch(from_next);
    esmLOCKprefetch(from_prev);
    esmLOCKprefetch(to_next);
    if (!esmINTERFERENCE()) {
    // Unlink the "from" node from its list.
    from_prev->next = from_next; // XXX: `from_prev` may be nil?
    from_next->prev = from_prev; // XXX: `from_next` may be nil?
    // Insert the node after "to".
    to->next = from;
    to_next->prev = from; // XXX: `to_net` may be nil?
    from->prev = to;
    esmLOCKstore(from->next, to_next);
    return true;
    }
    return false;
    }

    Leaving aside the unchecked boundary conditions I highlighted
    above, there are other questions that this raises. You
    mentioned that the ESM stuff can use up to 8 cache lines in a
    single atomic "event"; let's count how many this code may
    encounter:

    1. `from` may point to a cache line
    2. `to` may point to a different cache line.
    3. Does your architecture use a stack?

    No

    What sort of alignment criteria do you impose?

    On the participating data none. On the cache line--cache line.

    At first blush, it seems to me that
    the pointers `from_next`, `from_prev`, and `to_next` could be
    on the stack and if so, will be on at least one cache line,
    and possibly 2, if the call frame for `MoveElement` spans
    more than one.

    The illustrated code is using 6 participating cache lines.
    Where local variables are kept (stack, registers) does not
    count against the tally of participating cache lines.

    Of course, it's impossible to say when this
    is compiled, but maybe you require that the stack is always
    aligned to a cache line boundary or something.

    Data stack is DoubleWord aligned at all times.

    Or maybe
    these are all in registers and it's fine.

    Local variables are not participating in the event--they are just
    around to guide the event towards conclusion.

    4. Depending on the size and alignment criteria of the the
    `Element` structure and the placement of the `next` and
    `prev` elements in the struct (unknowable from this example),
    `to->next` may be on another cache line.
    5. Similarly, `from->next`
    6. And `from->prev`
    7. And `from_prev->next`
    8. And `from_next->prev`
    9. And `to_next->prev`.

    So potentially this code _could_ touch 9 or 10 cache lines, more
    than you said are supported for a single atomic event. What
    happens in that case? Does the code just always return false?
    The programmer best make sure the optimizer is keeping those
    temporary pointers off the stack; good luck with a debug build.

    Anyway, let's assume that's not the case, and we don't hit the
    pathological case with cache lines, and we're not dealing with
    edge cases at the front or end of the lists (so various next or
    prev pointers are not NULL). Why don't I try to explain what I
    think this code is doing, and you tell me whether I'm right or
    wrong.

    If I understand this correctly, the `esmLOCKload`s are simply
    loads, but they make sure that the compiler emits instructions
    that tie into your atomic support framework.

    So far so good.

    Earlier you wrote, "esmINTERFERENCE is a query on the state of th event and whether
    a 2nd party has performed a memory action that kills the event."
    Ok, I take this to mean _at the time of the store_, which
    appears later in the program. If it were simply a query at the
    time it was written in the code, then it seems like it opens a
    TOCTOU bug (what if something interfers after the check, but
    before the the locked store at the end?). Or perhaps it signals
    to the architecture that this is the point to which the
    processor should rollback, if anything subsequently fails; sort
    of a `setjmp` kind of thing that conceptually returns twice.

    The "if( esmINTERFERENCE() )" not only queries the state of the
    participating cache lines, it sets up a guarding-shadow over
    the STs to follow such that if killing-interference is detected
    none of the STs are performed AND control is transferred to the
    branch label (outside of the event). So not only does it query
    the state at it execution, it monitors that over the rest of
    the event.

    BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
    is that if we have reached the manifestation phase of the event
    (that is ready to do all the stores) then we should be able to
    drive the event forward by not sending the data in response to
    the SNOOP. We send a Nak (based on priority or the data)

    If the SNOOPing core receives a NaK and it is not attempting an
    event the request is simply reissued. If it is attempting, then
    its event fails. So a core benignly accessing data that another
    core is attempting an event upon, is delayed by an interconnect
    round trip--but if it is attempting, it knows its event will
    fail and it will either retry from before the event started,
    or it will goto the interferrence-label where other facilities
    \are at its disposal.

    Assuming nothing has poisoned the event, the various pointer
    stores inside the conditional are conditionally performed; the
    `esmLOCKstore` to `from->next` commits the event, at which
    point all side effects are exposed and durable.

    From a SW perspective, that is accurate enough.

    From a HW perspective, the participating STs are performed and
    held in a buffer, and when it is known the event has succeeded,
    a bit is flipped and these writes to memory all appear to have
    been performed in the same cycle. One cycle earlier and no
    interested 2nd party has seen any of them, the subsequent cycle
    any interested 2nd party will see them all.

    I should also note: ESM works even when there are no caches in
    the system. ...


    Is that correct? Let's suppose that it is.

    But later you wrote,

    So, if you want the property whereby the lock disappears on any
    control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>then you want to use my ATOMIC stuff; otherwise, you can use the
    normal ATOMIC primitives everyone and his brother provide.

    Well, what precisely do you mean here when you say, "if you want
    the property whereby the lock disappears on any control transfer
    out of the event"?

    If you want that property--you use the tools at hand.
    If you don't just use them as primitive generators.

    What I _hope_ you mean is that if you transfer "out of the
    event" before the `esmLOCKstore` then anything that's been done
    since the "event" started is rolled back, but NOT if control
    transfer happens _after_ the `esmLOCKstore`. If THAT were the
    case, then this entire mechanism is useless.

    One can use a esmLOCKstore( *SP, 0 ) to abandon an event. HW
    detects that *SP is not participating in the event, and uses
    the lock bit in the instruction to know it is time to leave
    the event in the failed manner. No ST is performed to the
    non-participating cache line. {{I guess I could wrap the
    abandonment into a macro ...}}

    Perhaps you would clarify?

    But for the time being, let's assume that `esmLOCKstore` is
    conceptually a commit that ends the atomic event, and the store
    is durable and observable after that. Let's see if we can
    construct a simple spinlock from these primitives.

    Notice that only the last ST has the esmLOCKstore() applied.
    HW knows that STs to participating lines have the ATOMIC
    property. The final ST uses the LOCK bit to denote the end
    of the event.


    typedef struct Spinlock Spinlock;
    struct Spinlock {
    uint32_t lock;
    };

    uint32_t
    xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
    {
    uint32_t v = esmLOCKload(&lk->lock);
    if (v == expected && !esmINTERFERENCE()) {
    esmLOCKstore(&lk->lock, val)
    }
    return v;
    }

    If the interference check is desired at all:: (see below)
    I think you should move interference check in front of the
    expected check, but that is a first impression--have to think
    about it more.

    void
    spin_acquire(volatile Spinlock *lk)
    {
    splhi();
    while (xcas(lk, 0, 1) != 0)
    relax(); // Spin loop intrinsic.
    }

    void
    spin_release(volatile Spinlock *lk)
    {
    xcas(lk, 1, 0);
    // Maybe we should `spllo` here, maybe not. Either choice
    // reflects lots of assumptions about whether or not we might
    // take and hold multiple locks concurrently, etc.
    spllo();
    }

    Does this seem correct? It is not immediately clear to me
    whether `esmINTERFERENCE` is actually needed here, but
    avoiding the store seemed fine.

    There are MANY cases where the interference query is not needed.
    Without the interference query, deadly interference backs the
    core up to the first instruction of the event. With the interference
    check, if deadly interference happens before the query, control
    reverts to before the event, and afterwards, control is transferred
    to the interference-label.

    One could imagine shorting
    `xcas` to simply:

    uint32_t
    xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
    {
    uint32_t v = esmLOCKload(&lk->lock);
    if (v == expected) esmLOCKstore(&lk->lock, val);
    return v;
    }

    Yes, unless you want the alternate exit strategy, this works fine.
    In fact test and set is written::

    boolean char TestAndSet( type char *p )
    {
    boolean t = esmLOCKload ( *p );
    esmLOCKstore( *p, 1 );
    return t;
    }

    Should interference transpire between load and store control
    reverts back to the LD again, so the return value is the LD
    which did not encounter interference.

    or perhaps:

    uint32_t
    xswap(volatile Spinlock *lk, uint32_t val)
    {
    uint32_t v = esmLOCKload(&lk->lock);
    esmLOCKstore(&lk->lock, val);
    return v;
    }

    But that all depends on the specifics of the atomics.

    For the sizes of CSs this covers--it prevent priority inversion.

    Except you've admitted that this is not a general solution, as
    it cannot "cover" arbitrary software-defined critical sections:
    the window isn't big enough and we've seen how even a fairly
    small problem may exceed the number of cache lines supported by
    one of these events.

    Remember only the lines participating in the event count. For
    example, say one had sprintf() inside the event. None of the
    lines touched by springt() are participating in the event,
    and are not counted against the 8 cache lines available.
    ---------------------
    So now I, as a systems software person, have to ask design
    question around how I want to use these architectural features,
    if at all, weighing various pros and cons. This is where I'm
    struggling with your proposal: given that I'm going to have to
    implement more general support for the things that the
    architecture gives me anyway, what do I gain by also supporting
    the specialized mechanisms?

    All I can say about this, was that when I presented ASF (ESM's
    predecessor) to Microsoft and spend a couple hours explaining
    to them how it works, MS wanted it implemented as rapidly as
    AMD could pull it off. Dave Cutler was particularly impressed. ----------------------
    And here I thought you had an imagination .....

    You admitted you're not an OS person and asked other people to
    poke holes in your proposal. I'm sorry if you find the holes
    I'm poking upsetting.

    I enjoy your poking of holes--I really do

    But you seem like the calculus student that keeps asking what
    problem can I solve now that I can differentiate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Mar 26 21:35:01 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:

    In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,


    Earlier you wrote,
    "esmINTERFERENCE is a query on the state of th event and whether
    a 2nd party has performed a memory action that kills the event."
    Ok, I take this to mean _at the time of the store_, which
    appears later in the program. If it were simply a query at the
    time it was written in the code, then it seems like it opens a
    TOCTOU bug (what if something interfers after the check, but
    before the the locked store at the end?). Or perhaps it signals
    to the architecture that this is the point to which the
    processor should rollback, if anything subsequently fails; sort
    of a `setjmp` kind of thing that conceptually returns twice.

    The "if( esmINTERFERENCE() )" not only queries the state of the
    participating cache lines, it sets up a guarding-shadow over
    the STs to follow such that if killing-interference is detected
    none of the STs are performed AND control is transferred to the
    branch label (outside of the event). So not only does it query
    the state at it execution, it monitors that over the rest of
    the event.

    BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
    is that if we have reached the manifestation phase of the event
    (that is ready to do all the stores) then we should be able to
    drive the event forward by not sending the data in response to
    the SNOOP. We send a Nak (based on priority or the data)

    If you can NAK a snoop, are you then assuming that all DMA is
    non-coherent? That will really piss off the kernel folks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Mar 26 23:22:35 2025
    On Wed, 26 Mar 2025 21:35:01 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:

    In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,


    Earlier you wrote,
    "esmINTERFERENCE is a query on the state of th event and whether
    a 2nd party has performed a memory action that kills the event."
    Ok, I take this to mean _at the time of the store_, which
    appears later in the program. If it were simply a query at the
    time it was written in the code, then it seems like it opens a
    TOCTOU bug (what if something interfers after the check, but
    before the the locked store at the end?). Or perhaps it signals
    to the architecture that this is the point to which the
    processor should rollback, if anything subsequently fails; sort
    of a `setjmp` kind of thing that conceptually returns twice.

    The "if( esmINTERFERENCE() )" not only queries the state of the >>participating cache lines, it sets up a guarding-shadow over
    the STs to follow such that if killing-interference is detected
    none of the STs are performed AND control is transferred to the
    branch label (outside of the event). So not only does it query
    the state at it execution, it monitors that over the rest of
    the event.

    BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
    is that if we have reached the manifestation phase of the event
    (that is ready to do all the stores) then we should be able to
    drive the event forward by not sending the data in response to
    the SNOOP. We send a Nak (based on priority or the data)

    If you can NAK a snoop, are you then assuming that all DMA is
    non-coherent? That will really piss off the kernel folks.

    Not at all. DMA is as coherent as the I/O MMU's translating
    PTE prescribes.

    But; why would any sane coder be performing an ATOMIC event on
    memory that can be blindly overwritten by DMA ??? Or for that
    mater, critical section data being overwritten (at some unknown
    and unpredictable time) by DMA ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Mar 26 23:16:19 2025
    On Wed, 26 Mar 2025 19:50:14 +0000, Stefan Monnier wrote:

    Remember only the lines participating in the event count. For
    example, say one had sprintf() inside the event. None of the
    lines touched by sprint() are participating in the event,
    and are not counted against the 8 cache lines available.

    Hmm... so if I need to add a "participating line" to which I need to
    store but that I don't need to read, I still need to do a "useless" `esmLOCKload` before the (normal) store, just to mark the line as participating in the event?

    use esmLOCKprefetch() for bringing data towards the cache without
    wasting a register on containing its value.

    I guess it's a rather rare need.

    Rare enough it was illustrated in my very first example.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Mar 26 23:48:17 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 26 Mar 2025 21:35:01 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:

    In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,


    Earlier you wrote,
    "esmINTERFERENCE is a query on the state of th event and whether
    a 2nd party has performed a memory action that kills the event."
    Ok, I take this to mean _at the time of the store_, which
    appears later in the program. If it were simply a query at the
    time it was written in the code, then it seems like it opens a
    TOCTOU bug (what if something interfers after the check, but
    before the the locked store at the end?). Or perhaps it signals
    to the architecture that this is the point to which the
    processor should rollback, if anything subsequently fails; sort
    of a `setjmp` kind of thing that conceptually returns twice.

    The "if( esmINTERFERENCE() )" not only queries the state of the >>>participating cache lines, it sets up a guarding-shadow over
    the STs to follow such that if killing-interference is detected
    none of the STs are performed AND control is transferred to the
    branch label (outside of the event). So not only does it query
    the state at it execution, it monitors that over the rest of
    the event.

    BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
    is that if we have reached the manifestation phase of the event
    (that is ready to do all the stores) then we should be able to
    drive the event forward by not sending the data in response to
    the SNOOP. We send a Nak (based on priority or the data)

    If you can NAK a snoop, are you then assuming that all DMA is
    non-coherent? That will really piss off the kernel folks.

    Not at all. DMA is as coherent as the I/O MMU's translating
    PTE prescribes.

    But; why would any sane coder be performing an ATOMIC event on
    memory that can be blindly overwritten by DMA ??? Or for that
    mater, critical section data being overwritten (at some unknown
    and unpredictable time) by DMA ??

    Polling on a completion status which the device DMAs
    (often with an allocate hint) is not uncommon. A 32-bit
    word in memory updated by the device - the allocate hint
    might update the LLC directly and invalidate L1/L2 or
    may update the target cache line that the driver is
    polling directly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Mar 27 00:18:14 2025
    On Wed, 26 Mar 2025 23:48:17 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 26 Mar 2025 21:35:01 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:

    In article <b1c74762d01f71cc1b8ac838dcf6d4fa@www.novabbs.org>,


    Earlier you wrote,
    "esmINTERFERENCE is a query on the state of th event and whether
    a 2nd party has performed a memory action that kills the event."
    Ok, I take this to mean _at the time of the store_, which
    appears later in the program. If it were simply a query at the
    time it was written in the code, then it seems like it opens a
    TOCTOU bug (what if something interfers after the check, but
    before the the locked store at the end?). Or perhaps it signals
    to the architecture that this is the point to which the
    processor should rollback, if anything subsequently fails; sort
    of a `setjmp` kind of thing that conceptually returns twice.

    The "if( esmINTERFERENCE() )" not only queries the state of the >>>>participating cache lines, it sets up a guarding-shadow over
    the STs to follow such that if killing-interference is detected
    none of the STs are performed AND control is transferred to the
    branch label (outside of the event). So not only does it query
    the state at it execution, it monitors that over the rest of
    the event.

    BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
    is that if we have reached the manifestation phase of the event
    (that is ready to do all the stores) then we should be able to
    drive the event forward by not sending the data in response to
    the SNOOP. We send a Nak (based on priority or the data)

    If you can NAK a snoop, are you then assuming that all DMA is
    non-coherent? That will really piss off the kernel folks.

    Not at all. DMA is as coherent as the I/O MMU's translating
    PTE prescribes.

    But; why would any sane coder be performing an ATOMIC event on
    memory that can be blindly overwritten by DMA ??? Or for that
    mater, critical section data being overwritten (at some unknown
    and unpredictable time) by DMA ??

    Polling on a completion status which the device DMAs
    (often with an allocate hint) is not uncommon. A 32-bit
    word in memory updated by the device - the allocate hint
    might update the LLC directly and invalidate L1/L2 or
    may update the target cache line that the driver is
    polling directly.

    Why do this ATOMICally ?? why not just read it ?? Or is this a
    grab (read and modify so someone else doesn't get it.) ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stefan Monnier on Thu Mar 27 10:55:08 2025
    Stefan Monnier wrote:
    a going to make a big difference here. But I also wonder at the
    internal implementation of these instructions in e.g. that
    Fujitsu chip that allows throughput to stay more or less
    constant as the number of cores increases; I'd think that cache
    coherency traffic as threads contend for a single cache line
    would create a bottleneck leading to (at least) a linear
    slow down.

    IIUC the idea is that the atomic-add is sent to the memory and performed there, instead of moving the data to the cache. So there's no
    contention for the cache line because the data is not in a cache at all.

    This one I have never understood the use/need for, except to provide
    global shared counters of some sort:

    In that case it would make perfect sense to allow any core to send (and immediately forget) a message "increment counter X by 1", but at least
    in my own coding, any LOCK XADD has been performed because I now need to
    do something with either the previous or the updated value of the counter.

    Having the atomic-add performed by a memory controller, returning the
    result, would make it almost perfectly scalable, but would also limit
    the speed of this operation to the bus transaction speed, even when
    there is zero contention, right?

    I guess if the atomic-add has been implemented at all/most cache levels
    as well, then it would indeed perform better for all workloads, the
    problem is that under contention you still have to flush the variable
    out to the RAM array, right?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Thu Mar 13 16:34:11 2025
    On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:

    Planning on having the Q+ CPU respond directly to MSI interrupts. My
    thought is to integrate an interrupt controller directly into the CPU
    which will have several queues to prioritize interrupts. The controller
    will take care of detecting the highest priority interrupt to process, performing priority inversion occasionally, and detect stuck interrupts.

    My 66000 went all the way to MSI-X interrupts as its base.

    There are mapping methods to map INTx and INTk into MSI-X messages.
    MSI->MSI-X mapping is "just zero extend the message".

    The Interrupt port lives in the memory controller and takes an
    aperture of addresses and maps them to various interrupt tables.

    There is a broadcast mechanism that notifies cores of new higher
    priority level interrupts pending, and a protocol for getting
    an interrupt and notifying the IC that it will be handled by a
    core.

    Along with the 32-bit message, the architecture provides for
    6-bits of priority, 2-bits of SW-stack indexing, and an 8-bit
    ISR handler index number.

    Interrupt tables remain active even when the virtual machine
    it inactive. When next scheduled, priority ISRs will be serviced
    prior to low priority thread instructions.

    {And some things I can't talk about}

    Setting up an MSI interrupt protocol for Q+.

    The controller will sit on the CPU's main response collect bus.
    Interrupt messages contain the interrupter's bus/device/function + a
    cause code. They may also contain a target core number.

    Planning on distributing the interrupt control to peripheral devices
    instead of having a system level interrupt controller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Thu Mar 13 18:42:29 2025
    Robert Finch <robfi680@gmail.com> writes:
    On 2025-03-13 12:34 p.m., MitchAlsup1 wrote:
    On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:

    Planning on having the Q+ CPU respond directly to MSI interrupts. My
    thought is to integrate an interrupt controller directly into the CPU
    which will have several queues to prioritize interrupts. The controller
    will take care of detecting the highest priority interrupt to process,
    performing priority inversion occasionally, and detect stuck interrupts.

    My 66000 went all the way to MSI-X interrupts as its base.

    There are mapping methods to map INTx and INTk into MSI-X messages.
    MSI->MSI-X mapping is "just zero extend the message".

    MSI interrupts are the way to go when there are transistors available.
    Just having discovered them not to long ago, I still think they are great.
    I have the luxury of not needing to support legacy interrupt systems
    (yet), all the peripheral cores needing interrupts will have to support
    MSI responses.

    MSI has been deprecated for years. MSI-X is preferred and used by
    most modern devices (and supported by major operating software).


    The Interrupt port lives in the memory controller and takes an
    aperture of addresses and maps them to various interrupt tables.

    I have read that MSI devices target particular I/O addresses to send the >messages to. Not sure what having a dedicated I/O addresses buys. In my >architecture, rather than target an I/O address a specific processing
    core number is specified.

    There are many reasons that MSI-X supports vector specific target addresses.
    1) It is possible to target DRAM (so software can poll for the interrupt)
    2) It is possible to support level-sensitive semantics by using two vectors;
    one to assert the interrupt and the second to deassert it. Each needs
    to target a different interrupt controller register (address).
    3) A system may have multiple interrupt controllers for different
    security states (e.g. monitor/secure vs. hypervisor vs. direct virtual
    to a guest).



    My architecture goes with a 3-bit priority field, and an 8-bit field to
    help resolve the cause of the interrupt. The bus/device/function is also
    part of the interrupt response.

    That's what ARM64 GIC does, each interrupt is accompanied by a 20+ bit
    'device id', which is composed of the segment, bus, dev/func values
    for the sending device. Since each PCIe controller can consume the
    entire bus space potentially, the segment widens the device id space
    to include a large number of devices. The 'device id' is used to
    select the in-memory table used to map the interrupt to the target
    guest and the target guest vector number.

    I think you really want an interrupt controller as an intermediary
    between the device and the CPU interrupt signal, allows a great deal
    of implementation flexibility (e.g. changing the target CPU in the
    interrupt controller tables rather than requiring the device driver
    to change the msix-vector address field).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Mar 13 18:34:32 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:

    Planning on having the Q+ CPU respond directly to MSI interrupts. My
    thought is to integrate an interrupt controller directly into the CPU
    which will have several queues to prioritize interrupts. The controller
    will take care of detecting the highest priority interrupt to process,
    performing priority inversion occasionally, and detect stuck interrupts.

    My 66000 went all the way to MSI-X interrupts as its base.

    There are mapping methods to map INTx and INTk into MSI-X messages. >MSI->MSI-X mapping is "just zero extend the message".

    In both cases, the a 32-bit data payload (the MSIX Vector #) is written
    to the MSI-X target address (typically the interrupt controller).

    The difference between them is

    MSI MSI-X
    Supports 1 to 32 Supports an unlimited number of
    _contiguous_ data vectors, each with an unique non-contiguous
    payloads (vector vector index (data payload) and target
    index). address.

    With MSI, only the base interrupt number (data payload) is
    defined and only one interrupt controller target address is
    use used for all vectors. So vector 0 will send the base number, vector 1
    will send the base number plus one, etc.

    With MSI-X each vector has an assigned data payload (vector
    number) which do not need to be contiguous or monotonically
    increasing. With MSI-X, each interrupt can have a different
    target address (i.e. perhaps a secure interrupt controller
    for some, and a non-secure interrupt controller for others)

    In the ARM case, there are different interrupt types (SGI,
    PPI, SPI, LPI) and each one has a different interrupt controller
    address that the MSI-X vector address should point to. This
    allows the message-based emulation of level sensitive interrupts
    by using two vectors, one to assert the interrupt and a second
    to deassert the interrupt at the interrupt controller.

    So MSI is useless for modern devices and basically deprecated
    and cannot be used support level sensitive (e.g. legacy PCI)
    interrupts.

    Most modern devices advertise the MSI-X capability instead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Mar 13 19:28:38 2025
    On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:

    Planning on having the Q+ CPU respond directly to MSI interrupts. My
    thought is to integrate an interrupt controller directly into the CPU
    which will have several queues to prioritize interrupts. The controller
    will take care of detecting the highest priority interrupt to process,
    performing priority inversion occasionally, and detect stuck interrupts.

    My 66000 went all the way to MSI-X interrupts as its base.

    There are mapping methods to map INTx and INTk into MSI-X messages. >>MSI->MSI-X mapping is "just zero extend the message".

    In both cases, the a 32-bit data payload (the MSIX Vector #) is written
    to the MSI-X target address (typically the interrupt controller).

    The difference between them is

    MSI MSI-X
    Supports 1 to 32 Supports an unlimited number of
    _contiguous_ data vectors, each with an unique non-contiguous
    payloads (vector vector index (data payload) and target
    index). address.

    Yes, MSI is a strict and degenerate version of MSI-X

    With MSI, only the base interrupt number (data payload) is
    defined and only one interrupt controller target address is
    use used for all vectors. So vector 0 will send the base number, vector
    1 will send the base number plus one, etc.

    With MSI-X each vector has an assigned data payload (vector
    number) which do not need to be contiguous or monotonically
    increasing. With MSI-X, each interrupt can have a different
    target address (i.e. perhaps a secure interrupt controller
    for some, and a non-secure interrupt controller for others)

    In the ARM case, there are different interrupt types (SGI,
    PPI, SPI, LPI) and each one has a different interrupt controller
    address that the MSI-X vector address should point to.

    In My 66000 case, there is an 8-bit field from the Originating-
    Software-Stack (which I/O MMU uses in translation) that prioritize
    DMA*, interrupt delivery, page fault delivery, and device error
    delivery.

    (*) DRAM access in My 66000 uses the priority of the core at the
    instant of AGEN to allow higher priority cache miss requests to
    percolate through the interconnect faster, and to allow a higher
    priority ATOMIC event to actively interfere with a lower priority
    ATOMIC event, but not the reverse. The interconnect uses it more
    like a virtual channel than a priority.

    As to that 8-bit field:: when an interrupt of higher priority than
    a core is currently running arrives, core captures the interrupt.
    The interrupt message (now 42-bits in size) sets the priority of
    the software-stack (2-bits). SW then extracts the 8-bit field
    and uses it to dispatch to an interrupt-class service handler
    {ISR, Timer, ECC-fail, SVC, SVR, SATA, NVMe, keyboard, mouse, ...}.

    At the instant control arrives, the dispatcher is already re-entrant
    {able to take a higher priority event even of itself} running at its
    priority, under its Root Pointers, and ASIDs, with its pending
    exceptions, and register file, with registers R0..4 filled with
    exception data union{Instruction, Operand 1..3},...}.

    This
    allows the message-based emulation of level sensitive interrupts
    by using two vectors, one to assert the interrupt and a second
    to deassert the interrupt at the interrupt controller.

    I will have to borrow a trick such as this.

    So MSI is useless for modern devices and basically deprecated
    and cannot be used support level sensitive (e.g. legacy PCI)
    interrupts.

    That was my view once I ran into MSI-X:: MSI-X is a complete and
    more flexible superset of MSI.

    Most modern devices advertise the MSI-X capability instead.

    And why not:: its just a few more flip-flops and almost no more
    sequencing logic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Thu Mar 13 19:12:34 2025
    On Thu, 13 Mar 2025 17:49:35 +0000, Robert Finch wrote:

    On 2025-03-13 12:34 p.m., MitchAlsup1 wrote:
    On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:

    Planning on having the Q+ CPU respond directly to MSI interrupts. My
    thought is to integrate an interrupt controller directly into the CPU
    which will have several queues to prioritize interrupts. The controller
    will take care of detecting the highest priority interrupt to process,
    performing priority inversion occasionally, and detect stuck interrupts.

    My 66000 went all the way to MSI-X interrupts as its base.

    There are mapping methods to map INTx and INTk into MSI-X messages.
    MSI->MSI-X mapping is "just zero extend the message".

    MSI interrupts are the way to go when there are transistors available.
    Just having discovered them not to long ago, I still think they are
    great.
    I have the luxury of not needing to support legacy interrupt systems
    (yet), all the peripheral cores needing interrupts will have to support
    MSI responses.

    The Interrupt port lives in the memory controller and takes an
    aperture of addresses and maps them to various interrupt tables.

    I have read that MSI devices target particular I/O addresses to send the messages to. Not sure what having a dedicated I/O addresses buys. In my architecture, rather than target an I/O address a specific processing
    core number is specified.

    MXI and MSI-X target an MMI/O address aperture (i.e., range), the
    position and the meaning of the address bits and the message bits
    are system dependent.

    It is expected that some address bits will indicate what interrupt
    table to access.

    It is expected that either (or both) bits from the address and message
    will indicate priority of the interrupt.

    In a classical Operating System, each GuestOS gets its own interrupt
    table, and any core running that GuestOS will be watching that inter-
    rupt table. Hypervisors get their own interrupt tables, as do Secure
    monitors. {{To hand waving accuracy}}


    There is a broadcast mechanism that notifies cores of new higher
    priority level interrupts pending, and a protocol for getting
    an interrupt and notifying the IC that it will be handled by a
    core.

    Along with the 32-bit message, the architecture provides for
    6-bits of priority, 2-bits of SW-stack indexing, and an 8-bit
    ISR handler index number.

    My architecture goes with a 3-bit priority field, and an 8-bit field to
    help resolve the cause of the interrupt.

    Having an advocate of such a position (EricP BTW) that 8-priority
    levels are sufficient; made me give great through to this space.

    I ended up with 64 priority levels, mainly, because I know of no
    general purpose way of comparing priority levels found in JoeOS1
    versus those in JoeOS2 or JoeOS3. 8-levles is arguably sufficient
    for JoeOS[k]. But what if one wants some of JoeOS2's priorities
    lower than all JoeOS1's priorities, but the others above them ?!?

    So, 64-levels, occupying 1-bit each fit exactly the typical HW
    storage container (64-bits) allows for 8-JoeOSs to have their
    8-levels each to be fit into one concept we call system-priority
    without any constraint on any JoeOS-chosen at random.

    The bus/device/function is also
    part of the interrupt response. A 6-bit target core number field is
    included in the response too. A target core number of all ones is
    treated as a broadcast to all cores. The entire interrupt response
    message is available to the CPU so software can decide how to handle the interrupt. There is some potential for data to be passed back. The
    contents of the data field passed back on the bus is
    PCIe
    unspecified. The
    other info is passed back in the address field of the bus.

    Interrupt tables remain active even when the virtual machine
    it inactive. When next scheduled, priority ISRs will be serviced
    prior to low priority thread instructions.

    {And some things I can't talk about}
    Shucks.

    Setting up an MSI interrupt protocol for Q+.

    The controller will sit on the CPU's main response collect bus.
    Interrupt messages contain the interrupter's bus/device/function + a
    cause code. They may also contain a target core number.

    Planning on distributing the interrupt control to peripheral devices
    instead of having a system level interrupt controller.

    I sketched up an interrupt controller in SystemVerilog and it was only
    about 700 LUTs although it does use seven BRAMs for priority queues.
    The priority queues can hold hundreds of interrupt messages which is
    likely far more than would be needed. So if a queue overflows it would indicate an issue with the interrupt system. The controller also keeps a message history with timestamps. If the same message is present at the
    start and end of the history along with a short difference in the
    timestamp, it is considered a stuck interrupt.

    The controller inverts interrupt priorities for one cycle every 64 clock cycles. The priority inversion is with respect to which queue is read
    from. The idea is to try and prevent a low-priority task that is waiting
    on an interrupt from being permanently disabled.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Thu Mar 13 20:46:30 2025
    On Thu, 13 Mar 2025 19:59:02 +0000, Robert Finch wrote:

    On 2025-03-13 2:42 p.m., Scott Lurndal wrote:
    Robert Finch <robfi680@gmail.com> writes:
    On 2025-03-13 12:34 p.m., MitchAlsup1 wrote:
    On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:

    Planning on having the Q+ CPU respond directly to MSI interrupts. My >>>>> thought is to integrate an interrupt controller directly into the CPU >>>>> which will have several queues to prioritize interrupts. The controller >>>>> will take care of detecting the highest priority interrupt to process, >>>>> performing priority inversion occasionally, and detect stuck interrupts. >>>>
    My 66000 went all the way to MSI-X interrupts as its base.

    There are mapping methods to map INTx and INTk into MSI-X messages.
    MSI->MSI-X mapping is "just zero extend the message".

    MSI interrupts are the way to go when there are transistors available.
    Just having discovered them not to long ago, I still think they are
    great.
    I have the luxury of not needing to support legacy interrupt systems
    (yet), all the peripheral cores needing interrupts will have to support
    MSI responses.

    MSI has been deprecated for years. MSI-X is preferred and used by
    most modern devices (and supported by major operating software).


    The Interrupt port lives in the memory controller and takes an
    aperture of addresses and maps them to various interrupt tables.

    I have read that MSI devices target particular I/O addresses to send the >>> messages to. Not sure what having a dedicated I/O addresses buys. In my
    architecture, rather than target an I/O address a specific processing
    core number is specified.

    There are many reasons that MSI-X supports vector specific target
    addresses.
    1) It is possible to target DRAM (so software can poll for the
    interrupt)

    This can be done using a dummy core in the system that echos interrupts messages back to the DRAM. A current issue with the way interrupts in my system are working, is they always spit out the interrupt on the
    response bus, which means they cannot act like the master.

    To drive messages onto the bus as requests instead of responses every peripheral device would need to be able to be a bus master. It would add
    a lot of logic to the system. Extra cross-bar ports etc.


    2) It is possible to support level-sensitive semantics by using two
    vectors;
    one to assert the interrupt and the second to deassert it. Each
    needs
    to target a different interrupt controller register (address).

    I think this can be handled with the cause code being returned as part
    of the response.

    3) A system may have multiple interrupt controllers for different
    security states (e.g. monitor/secure vs. hypervisor vs. direct
    virtual
    to a guest).

    Is this just to get more target vectors for each interrupt? Could there
    be one interrupt controller with multiple target vectors depending on
    the security state?

    Take a AHCI STAT drive, that has multiple GuestOSs and several
    HyperVisors using various functions therein.

    While ALL ISRs from this device can use the same code, they must
    not be in a position where any state is shared across system-
    states. It is often easier to not share the code, even though it
    is exactly the same set of bit patterns because where the "other
    state" resides is necessarily different in GuestOS as it is in
    HyperVisor[k] as it is in Secure Monitor[j].

    Q+ CPU always switches to machine mode to process interrupts. It can
    then redirect to other operating modes. A security state code could be
    added to the response so that different vector tables may be used.

    Interrupts in My 66000 can be directed at any interrupt table, at
    any privilege level, at any priority. So, in practice, one could
    allow for a user to handle his own SATA interrupts without ever
    taking an excursion through any GuestOS or Hypervisor code/state;
    nor is it architecturally necessary to disable interrupts or
    exceptions.




    My architecture goes with a 3-bit priority field, and an 8-bit field to
    help resolve the cause of the interrupt. The bus/device/function is also >>> part of the interrupt response.

    That's what ARM64 GIC does, each interrupt is accompanied by a 20+ bit
    'device id', which is composed of the segment, bus, dev/func values
    for the sending device. Since each PCIe controller can consume the
    entire bus space potentially, the segment widens the device id space
    to include a large number of devices. The 'device id' is used to
    select the in-memory table used to map the interrupt to the target
    guest and the target guest vector number.

    Seems like going out to memory tables to fetch the targets of an
    interrupt would be costly, perhaps slow. In the SoC there are already a
    half dozen devices fighting for access to the DRAM.

    Cache it and pay the cost once (or once per context switch).
    It is not like the ISR for the SATA drive for GuestOS[7]
    changes very often !! And it is (I S) used often.

    I think you really want an interrupt controller as an intermediary
    between the device and the CPU interrupt signal, allows a great deal
    of implementation flexibility (e.g. changing the target CPU in the

    Wrong abstraction::

    You want an interrupt service device that services interrupts
    (logs them into tables) even when no core (Virtual Machine or
    Virtual Machine Monitor) knows anything about said interrupt
    table. That is an interrupt to a table is serviced even when
    there is no recipient at this instant in time. This prevents
    loss of interrupts the table is supposed to accumulate.

    Only when at least 1 core is processing VM[k] or VMM[j]
    associated with that interrupt table are cores running
    VM[k] or VMM[j] concerned with any random interrupt from
    that table.
    ----------------
    Connecting the missing dots can be done several ways, My
    66000 does::

    My 66000 has a service port (in the memory controller) which
    services the MMI/O requests to the interrupt aperture. I (currently)
    call this the interrupt service port (rather than interrupt
    controller) because it just services messages to the interrupt
    aperture with almost no regard to any consumer. Said consumer
    regard is via a broadcast MMI/O message using the DRAM address
    of the interrupt table with a 64-bit message of raised interrupts.

    Should the Interrupt table CR match broadcast address, core
    captures the raised vector...and decided to deal with interrupt
    or not.

    A core can IPI a VM or VMM by performing a MMI/O write using
    the same bit patterns the I/O device would have used. Since
    all interrupts carry privilege (SW stack index) and priority
    along with 8-bit ISR and 32-bit message, SW has a "big Enough"
    interrupt name-space to do anything it likes.

    Finally, there is interrupt negotiation where a service port
    on the core sees the interrupt table broadcast and decides
    to try-to process said interrupt.

    Thus, the Interrupt Service Philosophy
    a) is not close to any core
    b) cores are notified by MMI/O broadcast
    c) cores continue to run code while negotiating for interrupts
    d) HW prefetches thread state and RF before interrupt acceptance
    e) anybody with VAS access can perform IPIs via MMU-PTEs

    Thus, the local Pic is a 6-rail-road-track state machine,
    with a single DW of state (&interrupt_table).

    Absolutely. Be crazy not to. There is an irq controller component
    separate from the CPU. (I decided to move the controller outside of the
    CPU).

    You need 2 concepts:
    a) the deposition of interrupts in the table
    b) the delivery of interrupts to consuming (cores)

    (a) must happen even when no core is looking at the table.
    (b) can happen by having SW perform read requests to interrupt
    aperture or by the normal HW broadcast mechanism.

    interrupt controller tables rather than requiring the device driver
    to change the msix-vector address field).

    The MSI-X address is (nominally) set once just after GuestOS device
    discovery. And becomes a virtual constant pointer afterwards.

    Updating I/O is faster than updating DRAM tables.

    Expand:: is this the distinction between flip-flops and a RAM
    buried inside a pre-defined pin-boundary ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Mar 13 21:14:08 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:


    Most modern devices advertise the MSI-X capability instead.

    And why not:: its just a few more flip-flops and almost no more
    sequencing logic.

    I've seen several devices with more than 200 MSI-X vectors;
    thats 96 bits per vector to store a full 64-bit address and
    32-bit data payload.

    Keep in mind that a guest operating system may be writing
    the 64-bit address field in a virtual function assigned to
    it with guest virtual addresses, so the inbound path for
    MSI-X needs to traverse the IOMMU before hitting the
    interrupt controller logic. The guest OS also specifies
    the data payload, which may be an IRQ number on an intel
    system, or a virtual IRQ number translated by the IOMMU
    or Interrupt controller into a physical IRQ number (allowing
    multiple guest OS to use the same IRQs they would use on real
    hardware).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 13 17:00:57 2025
    Updating I/O is faster than updating DRAM tables.

    What do you mean by DRAM tables?
    You mean (IO)MMU-style pagetables, or the tables mapping
    physical addresses to specific DRAM modules, or ...?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Mar 13 21:34:06 2025
    On Thu, 13 Mar 2025 21:14:08 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:


    Most modern devices advertise the MSI-X capability instead.

    And why not:: its just a few more flip-flops and almost no more
    sequencing logic.

    I've seen several devices with more than 200 MSI-X vectors;
    thats 96 bits per vector to store a full 64-bit address and
    32-bit data payload.

    At this point, with 200+ entries, flip-flops are not recommended,
    instead these would be placed in a RAM of some sort. Since RAMs
    come in 1KB and 2KB quanta; we have 1 of 2K and 1 of 1K and we
    have 256 said message containers, with 1 cycle access (after
    you get to that corner of some chip).

    Keep in mind that a guest operating system may be writing
    the 64-bit address field in a virtual function assigned to
    it with guest virtual addresses, so the inbound path for
    MSI-X needs to traverse the IOMMU before hitting the
    interrupt controller logic.

    An important concept that is hard to find in the PCIe specifications.

    The guest OS also specifies
    the data payload, which may be an IRQ number on an intel
    system, or a virtual IRQ number translated by the IOMMU
    or Interrupt controller into a physical IRQ number (allowing
    multiple guest OS to use the same IRQs they would use on real
    hardware).

    Just note that GuestOS[k].IRQ[j] uses the same bit patterns
    at the device, they get delivered to their own unique
    GuestOS through its interrupt table.

    But even when GuestOS[k] and GuestOS[j] use the same 96-bit MSI-X
    pattern, each GuestOS can interpret its message the way it wants,
    sort its drivers in the way it wants, and operate with blissful
    ignorance of GuestOS[other].

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Thu Mar 13 21:27:37 2025
    Robert Finch <robfi680@gmail.com> writes:
    On 2025-03-13 2:42 p.m., Scott Lurndal wrote:
    Robert Finch <robfi680@gmail.com> writes:
    On 2025-03-13 12:34 p.m., MitchAlsup1 wrote:
    On Thu, 13 Mar 2025 4:50:47 +0000, Robert Finch wrote:

    Planning on having the Q+ CPU respond directly to MSI interrupts. My >>>>> thought is to integrate an interrupt controller directly into the CPU >>>>> which will have several queues to prioritize interrupts. The controller >>>>> will take care of detecting the highest priority interrupt to process, >>>>> performing priority inversion occasionally, and detect stuck interrupts. >>>>
    My 66000 went all the way to MSI-X interrupts as its base.

    There are mapping methods to map INTx and INTk into MSI-X messages.
    MSI->MSI-X mapping is "just zero extend the message".

    MSI interrupts are the way to go when there are transistors available.
    Just having discovered them not to long ago, I still think they are great. >>> I have the luxury of not needing to support legacy interrupt systems
    (yet), all the peripheral cores needing interrupts will have to support
    MSI responses.

    MSI has been deprecated for years. MSI-X is preferred and used by
    most modern devices (and supported by major operating software).


    The Interrupt port lives in the memory controller and takes an
    aperture of addresses and maps them to various interrupt tables.

    I have read that MSI devices target particular I/O addresses to send the >>> messages to. Not sure what having a dedicated I/O addresses buys. In my
    architecture, rather than target an I/O address a specific processing
    core number is specified.

    There are many reasons that MSI-X supports vector specific target addresses. >> 1) It is possible to target DRAM (so software can poll for the interrupt)

    This can be done using a dummy core in the system that echos interrupts >messages back to the DRAM. A current issue with the way interrupts in my >system are working, is they always spit out the interrupt on the
    response bus, which means they cannot act like the master.

    To drive messages onto the bus as requests instead of responses every >peripheral device would need to be able to be a bus master. It would add
    a lot of logic to the system. Extra cross-bar ports etc.


    By its very nature, a device that generates an MSI-X message _must_
    have bus mastering capability (and in PCI/PCIe, the Master Enable
    bit needs to be set in the COMMAND register to allow the message
    to be transmitted to the root port).

    If one uses (and exposes to the OS) MSI-X, then all the semantics
    need to be followed, including the bus master enable bit (and
    the enable and mask bits in the MSI-X capability). Software will
    expect it.



    2) It is possible to support level-sensitive semantics by using two vectors;
    one to assert the interrupt and the second to deassert it. Each needs >> to target a different interrupt controller register (address).

    I think this can be handled with the cause code being returned as part
    of the response.

    3) A system may have multiple interrupt controllers for different
    security states (e.g. monitor/secure vs. hypervisor vs. direct virtual >> to a guest).

    Is this just to get more target vectors for each interrupt? Could there
    be one interrupt controller with multiple target vectors depending on
    the security state?

    Not really. The data payload allows specifying 2^32 unique interrupt
    numbers.

    Consider a system that has two security domains - one for firmware and
    perhaps video decoders that can interface directly with the graphics
    controller and the other to run standard operating software. Both
    need interrupts, but they need to be completely separate; the
    non-secure side and secure side of the interrupt controller are
    mostly independent (other than when looking at interrupt priority;
    half the priority range is reserved for the secure space, the other
    half for the non-secure space).

    This is how ARM's aarch64 works. In the ARM case, a wired,
    per-processor or inter-processor interrupt can be assigned to
    a security group by the secure software, such that secure
    interrupts would be delivered to the secure monitor and non-secure
    would be delivered to the hypervisor or bare-metal OS.

    MSI-X interrupts are always non-secure.


    That's what ARM64 GIC does, each interrupt is accompanied by a 20+ bit
    'device id', which is composed of the segment, bus, dev/func values
    for the sending device. Since each PCIe controller can consume the
    entire bus space potentially, the segment widens the device id space
    to include a large number of devices. The 'device id' is used to
    select the in-memory table used to map the interrupt to the target
    guest and the target guest vector number.

    Seems like going out to memory tables to fetch the targets of an
    interrupt would be costly, perhaps slow. In the SoC there are already a
    half dozen devices fighting for access to the DRAM.

    Cache. The ARM GIC caches the memory tables (and provides the
    appropriate flush and invalidate operations for configuration
    changes).

    When you need to support an interrupt space 24-bits wide, you can't
    do it with wires and flops.



    I think you really want an interrupt controller as an intermediary
    between the device and the CPU interrupt signal, allows a great deal
    of implementation flexibility (e.g. changing the target CPU in the

    Absolutely. Be crazy not to. There is an irq controller component
    separate from the CPU. (I decided to move the controller outside of the
    CPU).

    interrupt controller tables rather than requiring the device driver
    to change the msix-vector address field).

    Updating I/O is faster than updating DRAM tables.

    I need to study more on the use of interrupt tables.

    They don't change often; generally only when a new guest
    is created or an existing guest is destroyed.

    The ARM GIC also stores the interrupt priority in memory for
    the 2^24 interrupt (LPI) space; one byte per supported
    interrupt (upper limit programmable by the OS). A single enable bit
    and 7 priority bit per interrupt, indexed by the interrupt number.

    These properties are also cached in the GIC, so for a network
    device running at full speed at 200GB, they won't need to
    go to memory for every interrupt (although a network controller
    at that speed will use interrupt moderation and receive side
    scaling techniques to improve throughput and reduce interrupt
    overhead).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Mar 13 21:39:55 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 19:59:02 +0000, Robert Finch wrote:



    Is this just to get more target vectors for each interrupt? Could there
    be one interrupt controller with multiple target vectors depending on
    the security state?

    Take a AHCI STAT drive, that has multiple GuestOSs and several
    HyperVisors using various functions therein.

    Well, use NVMe as an example. AHCI only supports one function
    and cannot be virtualized.

    One can, however, have one AHCI controller per guest, but that
    would be both expensive and quite limiting such that a single
    guest would always need to use the same AHCI controller due to
    the actual disk connectivity.

    NVMe does support lots of guest OS accessing the device and
    apportions resources based on the control function (physical
    function 0, allocates virtual functions to the various guests
    with appropriate configuration to restrict a guest to a
    particular virtual SSD).


    I think you really want an interrupt controller as an intermediary
    between the device and the CPU interrupt signal, allows a great deal
    of implementation flexibility (e.g. changing the target CPU in the

    Wrong abstraction::

    You want an interrupt service device that services interrupts
    (logs them into tables) even when no core (Virtual Machine or
    Virtual Machine Monitor) knows anything about said interrupt
    table. That is an interrupt to a table is serviced even when
    there is no recipient at this instant in time. This prevents
    loss of interrupts the table is supposed to accumulate.

    That's exactly what an interrupt controller does. It routes
    interrupts to active guests and records pending interrupts
    for non-active guests (and interrupts to the hypervisor to
    schedule the guest).

    For recording pending, there are tables in DRAM for each
    guest describing the characteristics and routing (i.e.
    to which physical CPU(s) the guest is running on) for
    each interrupt. Give a 32 (or cut-down 24) bit
    interrupt space, that's definitely stored in DRAM,
    not flops or on-chip SRAM.


    Only when at least 1 core is processing VM[k] or VMM[j]
    associated with that interrupt table are cores running
    VM[k] or VMM[j] concerned with any random interrupt from
    that table.
    ----------------
    Connecting the missing dots can be done several ways, My
    66000 does::

    My 66000 has a service port (in the memory controller) which
    services the MMI/O requests to the interrupt aperture. I (currently)
    call this the interrupt service port (rather than interrupt
    controller) because it just services messages to the interrupt
    aperture with almost no regard to any consumer. Said consumer
    regard is via a broadcast MMI/O message using the DRAM address
    of the interrupt table with a 64-bit message of raised interrupts.

    And if a device has 1024 interrupt vectors per function? It's
    not uncommon.



    interrupt controller tables rather than requiring the device driver
    to change the msix-vector address field).

    The MSI-X address is (nominally) set once just after GuestOS device >discovery. And becomes a virtual constant pointer afterwards.

    Updating I/O is faster than updating DRAM tables.

    Expand:: is this the distinction between flip-flops and a RAM
    buried inside a pre-defined pin-boundary ??

    The tables should be in the main DRAM, accessible to the operating
    system (actually the operating system (or hypervisor) programs
    the base address of the tables into the interrupt controller).

    The cost for on-chip SRAM or worse, flops, to track all this
    state is both area and cost prohibitive.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Mar 13 23:34:22 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 21:14:08 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:


    Most modern devices advertise the MSI-X capability instead.

    And why not:: its just a few more flip-flops and almost no more >>>sequencing logic.

    I've seen several devices with more than 200 MSI-X vectors;
    thats 96 bits per vector to store a full 64-bit address and
    32-bit data payload.

    At this point, with 200+ entries, flip-flops are not recommended,
    instead these would be placed in a RAM of some sort. Since RAMs
    come in 1KB and 2KB quanta; we have 1 of 2K and 1 of 1K and we
    have 256 said message containers, with 1 cycle access (after
    you get to that corner of some chip).

    That is 200 _per function_. Consider a physical function that
    supports the SRIOV capability and configures 2048 virtual
    functions. So that's 2049 * 200 MSI-X vectors just for one
    device. It's not unusual. These vectors are, of course,
    stored on the device itself. Mostly in RAMs.


    Keep in mind that a guest operating system may be writing
    the 64-bit address field in a virtual function assigned to
    it with guest virtual addresses, so the inbound path for
    MSI-X needs to traverse the IOMMU before hitting the
    interrupt controller logic.

    An important concept that is hard to find in the PCIe specifications.

    It's more of a system-level thing than a PCI thing, so I
    wouldn't expect to find it the PCI specification. System
    level standards (like ARM's SBSA (server base system architecture)
    covers these type of topics).


    The guest OS also specifies
    the data payload, which may be an IRQ number on an intel
    system, or a virtual IRQ number translated by the IOMMU
    or Interrupt controller into a physical IRQ number (allowing
    multiple guest OS to use the same IRQs they would use on real
    hardware).

    Just note that GuestOS[k].IRQ[j] uses the same bit patterns
    at the device, they get delivered to their own unique
    GuestOS through its interrupt table.

    But even when GuestOS[k] and GuestOS[j] use the same 96-bit MSI-X
    pattern, each GuestOS can interpret its message the way it wants,
    sort its drivers in the way it wants, and operate with blissful
    ignorance of GuestOS[other].

    Linux, for example, knows that the format of the data payload
    on an MSI-X message is an unformatted 32-bit integer. Limiting
    that or defining a hard format is a non-starter, as the kernel
    software folks will just point and laugh. They really don't
    like accomodating new architectures that play fast and loose
    with the standard behavior. DAMHIKT.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Mar 13 23:52:10 2025
    On Thu, 13 Mar 2025 23:34:22 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 21:14:08 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:


    Most modern devices advertise the MSI-X capability instead.

    And why not:: its just a few more flip-flops and almost no more >>>>sequencing logic.

    I've seen several devices with more than 200 MSI-X vectors;
    thats 96 bits per vector to store a full 64-bit address and
    32-bit data payload.

    At this point, with 200+ entries, flip-flops are not recommended,
    instead these would be placed in a RAM of some sort. Since RAMs
    come in 1KB and 2KB quanta; we have 1 of 2K and 1 of 1K and we
    have 256 said message containers, with 1 cycle access (after
    you get to that corner of some chip).

    That is 200 _per function_. Consider a physical function that
    supports the SRIOV capability and configures 2048 virtual
    functions. So that's 2049 * 200 MSI-X vectors just for one
    device. It's not unusual. These vectors are, of course,
    stored on the device itself. Mostly in RAMs.

    In My 66000, said device can place those value-holding containers
    in actual DRAM should it want to punt the storage. This adds latency
    and decreases on-die storage. At that point, the device has an
    unlimited number of "special things".


    Keep in mind that a guest operating system may be writing
    the 64-bit address field in a virtual function assigned to
    it with guest virtual addresses, so the inbound path for
    MSI-X needs to traverse the IOMMU before hitting the
    interrupt controller logic.

    An important concept that is hard to find in the PCIe specifications.

    It's more of a system-level thing than a PCI thing, so I
    wouldn't expect to find it the PCI specification. System
    level standards (like ARM's SBSA (server base system architecture)
    covers these type of topics).


    The guest OS also specifies
    the data payload, which may be an IRQ number on an intel
    system, or a virtual IRQ number translated by the IOMMU
    or Interrupt controller into a physical IRQ number (allowing
    multiple guest OS to use the same IRQs they would use on real
    hardware).

    Just note that GuestOS[k].IRQ[j] uses the same bit patterns
    at the device, they get delivered to their own unique
    GuestOS through its interrupt table.

    But even when GuestOS[k] and GuestOS[j] use the same 96-bit MSI-X
    pattern, each GuestOS can interpret its message the way it wants,
    sort its drivers in the way it wants, and operate with blissful
    ignorance of GuestOS[other].

    Linux, for example, knows that the format of the data payload
    on an MSI-X message is an unformatted 32-bit integer. Limiting
    that or defining a hard format is a non-starter, as the kernel
    software folks will just point and laugh. They really don't
    like accomodating new architectures that play fast and loose
    with the standard behavior. DAMHIKT.

    Not al all (laughing Linux people); I make no specification to
    the system (at the architectural level) of how the 32-bit
    value is interpreted, nor which range of addresses in MMI/O
    space map to interrupt tables. Those ae all system choices--
    possibly dynamic (ASLR and all that, or moving once an hour...)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stefan Monnier on Thu Mar 13 23:27:38 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Updating I/O is faster than updating DRAM tables.

    What do you mean by DRAM tables?

    Please don't snip attributions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Thu Mar 20 12:50:09 2025
    In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:

    In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    So that you can have some mechanism for building critical
    sections that span multiple instructions, where just having
    access to atomics and spin locks isn't enough.

    For what kinds of activities are these critical sections ??

    Perhaps modifying the interrupt handlers themselves. Maybe
    resetting a stack. If you know that you're not in a place where
    you can reasonably save the state associated with taking an
    interrupt, you should disable interrupts.

    For example, I can schedule a DPC/sofIRQ without using a critical
    section (and so can even user threads without privilege, given
    the MMU allows). So, we need to dive through the "rest of the
    available mechanics" -- keyword available !

    For that one case. Perhaps there are others where a DPC-style
    mechanism (not mechanic) is not appropriate.

    There are mechanisms available that readers may not be aware of
    or quite grasp how a feature can be used.

    Indeed.

    So it comes back to:: For what kinds of activities are these
    critical sections ?? We know about DPC/softIRQs, we know
    about atomic events around core scheduling, and inter-{thread,
    process, core} communication. Others ?

    Booting the system, before interrupt handlers have been
    established.

    When running at any privilege, do you not want to accept an interrupt
    at higher priority with as little delay as possible ??

    Usually the assumption is that the critical section is short; a
    handful of instructions, perhaps, so that the cost of blocking a
    high priority interrupt is low.

    I think you are talking about critical sections that are not allowed
    to fail or be interfered with:: Yes ?

    Yup.

    But let me turn it around, what real-world problem does this
    interrupt prioritization scheme solve coupled with the inability
    to disable interrupts solve? Is anyone complaining that the
    ability to disable interrupts is somehow a serious burden?

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Thu Mar 20 12:44:08 2025
    In article <af2d54a7c6c694bf18bcca6e6876a72b@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:

    In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    So that you can have some mechanism for building critical
    sections that span multiple instructions, where just having
    access to atomics and spin locks isn't enough.

    You can do this with priority, too.

    In the limit, disabling interrupts is just setting the priority
    mask to infinity, so in some sense this is de facto true. But I
    don't see how having a prioritization scheme is all that useful.
    Consider that when you're starting the system up, you need to
    do things like set up an interrupt handler vector, and so on;
    surely you want to block all interrupts until you have done so.
    If your system model is that receipt of an interrupt may result
    in a context switch, and you don't want to do that while holding
    a mutex similarly.

    Sometimes you really don't want to be interrupted.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Thu Mar 20 12:03:12 2025
    MitchAlsup1 wrote:
    On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:

    In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    So that you can have some mechanism for building critical
    sections that span multiple instructions, where just having
    access to atomics and spin locks isn't enough.

    For what kinds of activities are these critical sections ??

    For example, I can schedule a DPC/sofIRQ without using a critical
    section (and so can even user threads without privilege, given
    the MMU allows). So, we need to dive through the "rest of the
    available mechanics" -- keyword available !

    There are mechanisms available that readers may not be aware of
    or quite grasp how a feature can be used.

    So it comes back to:: For what kinds of activities are these
    critical sections ?? We know about DPC/softIRQs, we know
    about atomic events around core scheduling, and inter-{thread,
    process, core} communication. Others ?

    When running at any privilege, do you not want to accept an interrupt
    at higher priority with as little delay as possible ??

    Usually the assumption is that the critical section is short; a
    handful of instructions, perhaps, so that the cost of blocking a
    high priority interrupt is low.

    I think you are talking about critical sections that are not allowed
    to fail or be interfered with:: Yes ?

    There are various situations.

    1) There are single memory locations used by the OS for communicating
    between interrupt priority levels with a single core
    that must be updated with an non interruptible R-M-W sequence.

    2) Rarely in a OS there may be multiple memory locations for communicating
    between interrupt priority levels within a single core
    that must be updated with an non interruptible R-M-W sequence.

    3) Sometimes devices are (mis)designed such that multiple control
    registers must be written in a non-interruptible manner.

    For handling #1 I have a set of non-interruptible RMW instructions.
    These are the same as their atomic RMW cousins except without fences
    because they are only used for communication within a single core.
    These take care of the vast majority of intra-core inter-level signaling:

    ninswp Non-interruptible swap
    nincas Non-interruptible compare and swap
    ninfadd Non-interruptible fetch ADD
    ninfand Non-interruptible fetch AND
    ninfor Non-interruptible fetch OR
    ninfxor Non-interruptible fetch XOR

    For situations 2 & 3 there really is no alternative but to shut off
    all interrupts for a few instructions.

    For example, there were two 8259 Interrupt Controllers in the PC that
    in order to do a return from interrupt had to be manipulated in a non-interruptible sequence. So you had to shut off all interrupts,
    write two 8259 control registers, then REI which re-enabled interrupts.

    People design devices where you write control register A with the
    number of a register that will be written when you write register B.
    I don't know why people keep doing this, but they do.
    If those registers A and B are written by multiple interrupt levels
    then that is a non-interruptible sequence and forces you disable all
    interrupts for that two write sequence.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Thu Mar 20 19:41:10 2025
    On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:

    In article <af2d54a7c6c694bf18bcca6e6876a72b@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:

    In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    So that you can have some mechanism for building critical
    sections that span multiple instructions, where just having
    access to atomics and spin locks isn't enough.

    You can do this with priority, too.

    In the limit, disabling interrupts is just setting the priority
    mask to infinity, so in some sense this is de facto true. But I
    don't see how having a prioritization scheme is all that useful.
    Consider that when you're starting the system up, you need to
    do things like set up an interrupt handler vector, and so on;

    The Interrupt vector is completely software. HW delivers the MSI-X
    message and the service for which this message is for, and SW decides
    what to do with this message to that service provider. No HW vectoring
    but HW does deliver all the bits needed to guide SW to its goal.

    surely you want to block all interrupts until you have done so.

    I did not say I don't HAVE an interrupt disable, I state that one
    receives control in an already reentrant state, so the typical
    uses of ID are not in play--and I want to talk about the other
    uses of ID. You are doing a fine job.

    If your system model is that receipt of an interrupt may result
    in a context switch, and you don't want to do that while holding
    a mutex similarly.

    Given the privilege to set priority, setting of priority is that
    same 1 cycle event that setting or clearing ID would be.

    Sometimes you really don't want to be interrupted.

    And sometimes you don't want to be interrupted unless the
    "house is on fire"; I cannot see a time when "the house is
    on fire" that you would not want to take the interrupt.

    Is there one ?!?

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Thu Mar 20 19:52:01 2025
    Thank you EricP for that list of uses strenuously wanting interrupts
    to be disabled.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Thu Mar 20 20:25:59 2025
    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:

    In article <af2d54a7c6c694bf18bcca6e6876a72b@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:

    In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    So that you can have some mechanism for building critical
    sections that span multiple instructions, where just having
    access to atomics and spin locks isn't enough.

    You can do this with priority, too.

    In the limit, disabling interrupts is just setting the priority
    mask to infinity, so in some sense this is de facto true. But I
    don't see how having a prioritization scheme is all that useful.
    Consider that when you're starting the system up, you need to
    do things like set up an interrupt handler vector, and so on;

    The Interrupt vector is completely software. HW delivers the MSI-X
    message and the service for which this message is for, and SW decides
    what to do with this message to that service provider. No HW vectoring
    but HW does deliver all the bits needed to guide SW to its goal.

    Maybe I'm missing the context here, and apologies if I am. But
    which interrupt disable are we talking about here, and where? I
    am coming at this from the perspective of systems software
    running on some CPU core somewhere; if I want that software to
    enter a critical section that cannot be interrupted, then it
    seems to me that MSI-X delivery is just a detail at that point.
    I might want to prevent IPIs, or delivery or local interrupts
    attached directly to the CPU (or something like an IOAPIC or
    GIC distributor).

    In any event, if software isn't prepared to receive an interrupt
    then there must necessarily be some mechanism to disable
    interrupt delivery until it is.

    surely you want to block all interrupts until you have done so.

    I did not say I don't HAVE an interrupt disable, I state that one
    receives control in an already reentrant state, so the typical
    uses of ID are not in play--and I want to talk about the other
    uses of ID. You are doing a fine job.

    I suppose what I'm saying in part is that the "typical users"
    all depend on the system's software architecture. The ability
    to handle nested interrupts at the hardware level is all well
    and good, but doesn't matter if software can't handle it.

    If your system model is that receipt of an interrupt may result
    in a context switch, and you don't want to do that while holding
    a mutex similarly.

    Given the privilege to set priority, setting of priority is that
    same 1 cycle event that setting or clearing ID would be.

    I guess I don't see how that is relevant. I mean, again, I
    suppose you could say, "set the priority mask to infinity to de
    facto disable interrupts", but that seems like a distinction
    without a difference.

    Sometimes you really don't want to be interrupted.

    And sometimes you don't want to be interrupted unless the
    "house is on fire"; I cannot see a time when "the house is
    on fire" that you would not want to take the interrupt.

    Is there one ?!?

    Consider a thread that takes a spinlock; suppose some
    high-priority interrupt comes in while the thread is holding
    that lock. In response to the interrupt, software decides to
    suspend the thread and switch some other thread; that thread
    wants to lock the spin lock that the now-descheduled thread is
    holding: a classic deadlock scenario.

    A valid response here might be, "don't context switch from the
    interrupt handler; use a DPC instead". That may be valid, but
    it puts a constraint on the software designer that may be onerus
    in practice: suppose the interrupt happens in the scheduler,
    while examining a run queue or something. A DPC object must be
    available, etc.

    Further, software must now consider the complexity of
    potentially interruptable critical sections. From the
    standpoint of reasoning about already-complex concurrency issues
    it's simpler to be able to assert that (almost) all interrupt
    delivery can be cheaply disabled entirely, save for very
    special, specific, scenarios like NMIs. Potentially switching
    away from a thread holding a spinlock sort of defeats the
    purpose of a spinlock in the first place, which is a mutex
    primitive designed to avoid the overhead of switching.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Thu Mar 20 21:55:35 2025
    On Thu, 20 Mar 2025 20:25:59 +0000, Dan Cross wrote:

    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:

    In article <af2d54a7c6c694bf18bcca6e6876a72b@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 19 Mar 2025 14:03:56 +0000, Dan Cross wrote:

    In article <36b8c18d145cdcd673713b7074cce6c3@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I want to address the elephant in the room::

    Why disable interrupts AT ALL !!

    So that you can have some mechanism for building critical
    sections that span multiple instructions, where just having
    access to atomics and spin locks isn't enough.

    You can do this with priority, too.

    In the limit, disabling interrupts is just setting the priority
    mask to infinity, so in some sense this is de facto true. But I
    don't see how having a prioritization scheme is all that useful.
    Consider that when you're starting the system up, you need to
    do things like set up an interrupt handler vector, and so on;

    The Interrupt vector is completely software. HW delivers the MSI-X
    message and the service for which this message is for, and SW decides
    what to do with this message to that service provider. No HW vectoring
    but HW does deliver all the bits needed to guide SW to its goal.

    Maybe I'm missing the context here, and apologies if I am. But
    which interrupt disable are we talking about here, and where? I

    We are talking about the core's Interrupt Disablement and nothing
    of the gates farther out in the system that control interrupt
    deliveries (or not).

    am coming at this from the perspective of systems software
    running on some CPU core somewhere; if I want that software to
    enter a critical section that cannot be interrupted, then it
    seems to me that MSI-X delivery is just a detail at that point.
    I might want to prevent IPIs, or delivery or local interrupts
    attached directly to the CPU (or something like an IOAPIC or
    GIC distributor).

    All of those flow through the same mechanism; even a locally
    generated timer interrupt.

    In any event, if software isn't prepared to receive an interrupt
    then there must necessarily be some mechanism to disable
    interrupt delivery until it is.

    I agree with this notion.

    surely you want to block all interrupts until you have done so.

    I did not say I don't HAVE an interrupt disable, I state that one
    receives control in an already reentrant state, so the typical
    uses of ID are not in play--and I want to talk about the other
    uses of ID. You are doing a fine job.

    I suppose what I'm saying in part is that the "typical users"
    all depend on the system's software architecture. The ability
    to handle nested interrupts at the hardware level is all well
    and good, but doesn't matter if software can't handle it.

    If your system model is that receipt of an interrupt may result
    in a context switch, and you don't want to do that while holding
    a mutex similarly.

    Given the privilege to set priority, setting of priority is that
    same 1 cycle event that setting or clearing ID would be.

    I guess I don't see how that is relevant. I mean, again, I
    suppose you could say, "set the priority mask to infinity to de
    facto disable interrupts", but that seems like a distinction
    without a difference.

    Sometimes you really don't want to be interrupted.

    And sometimes you don't want to be interrupted unless the
    "house is on fire"; I cannot see a time when "the house is
    on fire" that you would not want to take the interrupt.

    Is there one ?!?

    Consider a thread that takes a spinlock; suppose some
    high-priority interrupt comes in while the thread is holding
    that lock. In response to the interrupt, software decides to
    suspend the thread and switch some other thread; that thread
    wants to lock the spin lock that the now-descheduled thread is
    holding: a classic deadlock scenario.

    If we can s/taken/attempts/ in the first line of that paragraph::
    My architecture has a mechanism to perform ATOMIC stuff over multiple instruction time frames that has the property that a higher priority
    thread which interferers with a lower priority thread, the HPT wins
    and the LPT fails its ATOMIC event. It is for exactly this reason
    that I drag priority through the memory hierarchy--so that real-time
    things remain closer to real-time (no or limited priority inversions).

    Mandating to SW that typ SW must use this mechanism will not fly,
    at least in the first several iterations of an OS for the system.
    So the std. methods must remain available.

    AND to a large extend this sub-thread is attempting to get a
    usefully large number of them annotated.

    A valid response here might be, "don't context switch from the
    interrupt handler; use a DPC instead". That may be valid, but
    it puts a constraint on the software designer that may be onerus
    in practice: suppose the interrupt happens in the scheduler,
    while examining a run queue or something. A DPC object must be
    available, etc.

    This seems to be onerous on SW mainly because of too many unknowns
    to track down and deal with.

    Further, software must now consider the complexity of
    potentially interruptable critical sections. From the
    standpoint of reasoning about already-complex concurrency issues
    it's simpler to be able to assert that (almost) all interrupt
    delivery can be cheaply disabled entirely, save for very
    special, specific, scenarios like NMIs. Potentially switching
    away from a thread holding a spinlock sort of defeats the
    purpose of a spinlock in the first place, which is a mutex
    primitive designed to avoid the overhead of switching.

    Agreed.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Thu Mar 20 22:28:14 2025
    On Thu, 20 Mar 2025 12:50:09 +0000, Dan Cross wrote:

    In article <19ad12aadf9a22b760487418c871b3c6@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    But let me turn it around, what real-world problem does this
    interrupt prioritization scheme solve coupled with the inability
    to disable interrupts solve?

    The OS does not have to "walk" its schedule queues in order to
    context switch.

    The SVR (SuperVisor Return) instruction checks the priority of
    the thread that had been interrupted, and if it is of lower
    priority than what is pending in the interrupt table, the
    one on the interrupt table receives control while the lower
    priority thread remains interrupted.

    Thus, the return path from an ISR, just does an SVR. If a
    DPC/softIRQ has been setup, or another interrupt has become
    pending, HW chooses the path--acting as if HW returned control
    to the interrupted thread, and then immediately took the inter-
    rupt of highest priority still pending--without having executed
    a single instruction in the interrupted thread.

    Is anyone complaining that the
    ability to disable interrupts is somehow a serious burden?

    This thread started with the notion of MSI interrupts and the
    inherent need to DI upon arrival at ISR dispatcher. This sub-
    thread is investigating whether one can perform ISRs without
    disabling interrupts. Thus, making IDisablement an unnecessary
    part of interrupt servicing itself.

    Your (and EricP)'s contribution is to annotate those instances
    where one can arrive at an ISR IE and still want/need to DI and EI
    independent of the control transfer from and back to interrupted
    thread (even if control arrives reentrantly from interrupted thread).

    Many computer designers copy/modify their interrupt architectures
    from machines architected prior to the virtualization of Operating
    Systems, virtualization of interrupts, HyperVisors, Paranoid threads
    (like banking applications) that have survived into the present.

    I went the other way, trying to develop a system whereby the path
    across context switching domains is cheap and easy*. Not needing
    to DI at entry to ISR or exception handlers (more or less) fell
    out for free due to the way "the rest of the system" evolved.

    (*) cheap for HW, and easy expressly for SW

    Not being employed, and not having an OS guy to bounce ideas off
    and discuss consequences of going this way or that way is not
    available, so I lean on a few people hear to "beat me up" when
    my ideas lead me astray.

    18 months ago I had a real-time version of the architecture,
    with the property that at any instant in time, each core was
    executing the highest priority thread, cores had a means to
    message back and forth as if it were a remote procedure call
    which could return (synchronous; or not--asynchronous) without
    an OS mediating these things. I abandoned it not because it
    was not good for real time, but because there was no way to
    make it Linux Friendly.

    Where I am now, is my attempt at being Linux friendly.


    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Thu Mar 27 12:46:06 2025
    In article <efd7a61f2d4995d42225ecf999b6786f@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 26 Mar 2025 23:48:17 +0000, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 26 Mar 2025 21:35:01 +0000, Scott Lurndal wrote:
    [snip]
    If you can NAK a snoop, are you then assuming that all DMA is
    non-coherent? That will really piss off the kernel folks.

    Not at all. DMA is as coherent as the I/O MMU's translating
    PTE prescribes.

    But; why would any sane coder be performing an ATOMIC event on
    memory that can be blindly overwritten by DMA ??? Or for that
    mater, critical section data being overwritten (at some unknown
    and unpredictable time) by DMA ??

    Polling on a completion status which the device DMAs
    (often with an allocate hint) is not uncommon. A 32-bit
    word in memory updated by the device - the allocate hint
    might update the LLC directly and invalidate L1/L2 or
    may update the target cache line that the driver is
    polling directly.

    Why do this ATOMICally ?? why not just read it ?? Or is this a
    grab (read and modify so someone else doesn't get it.) ??

    Consider multiword DMA descriptors in a ring buffer in DRAM,
    shared with a device. Manipulating these races against hardware
    that's consuming the ring, and on many devices you need to do a
    multiword write to transfer ownership of a descriptor to the
    device for processing (say, setting both a DMA address and an
    ownership bit in two consecutive words). A few particular
    models of ethernet controllers in Cortex-M7 series parts from ST
    come to mind as examples of devices that do this. Of course
    proper maintenance of ring tail pointers in the device coupled
    with a barrier can mitigate this, but it would be nice to do
    both the descriptor and tail pointer updates atomically; instead
    of maintaining a separate tail pointer into the ring, you could
    just initiate a start operation by writing an arbitrary value
    into a register on the device.

    Indeed, for the aforementioned ST part, many drivers have gotten
    this stuff wrong. See, for example, https://github.com/oxidecomputer/hubris/pull/1493 and https://github.com/embassy-rs/embassy/pull/2155

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stefan Monnier on Thu Mar 27 13:42:56 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    ESM operates in 3 modes {optimistic, Greedy, Methodological}
    [...]

    Sounds fancy, but it also makes me worry about the complexity of >verifying&testing correctness (of the implementation).

    Agreed. A lot of complexity for an unknown performance impact.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 27 09:17:33 2025
    ESM operates in 3 modes {optimistic, Greedy, Methodological}
    [...]

    Sounds fancy, but it also makes me worry about the complexity of verifying&testing correctness (of the implementation).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Thu Mar 27 17:19:21 2025
    In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
    [snip]
    1. `from` may point to a cache line
    2. `to` may point to a different cache line.
    3. Does your architecture use a stack?

    No

    You make multiple references to a stack just below. I
    understand stack references are excluded from the atomic event,
    but I asked whether the architecture uses a stack. Does it, or
    doesn't it?

    What sort of alignment criteria do you impose?

    On the participating data none. On the cache line--cache line.

    I meant for data on the stack. The point is moot, though, since
    as you said stack data does not participate in the atomic event.

    At first blush, it seems to me that
    the pointers `from_next`, `from_prev`, and `to_next` could be
    on the stack and if so, will be on at least one cache line,
    and possibly 2, if the call frame for `MoveElement` spans
    more than one.

    The illustrated code is using 6 participating cache lines.
    Where local variables are kept (stack, registers) does not
    count against the tally of participating cache lines.

    Huh. How would this handle something like an MCS lock, where
    the lock node may be on a stack, though accessible globally in
    some virtual address space?

    Of course, it's impossible to say when this
    is compiled, but maybe you require that the stack is always
    aligned to a cache line boundary or something.

    Data stack is DoubleWord aligned at all times.

    Ok. So just for context, the point of this question was to
    understand whether local data participating in one of these
    atomic events could span multiple cache lines. Since they
    don't participate it doesn't matter vis the current discussion.

    these are all in registers and it's fine.

    Local variables are not participating in the event--they are just
    around to guide the event towards conclusion.

    Ok.

    4. Depending on the size and alignment criteria of the the
    `Element` structure and the placement of the `next` and
    `prev` elements in the struct (unknowable from this example),
    `to->next` may be on another cache line.
    5. Similarly, `from->next`
    6. And `from->prev`
    7. And `from_prev->next`
    8. And `from_next->prev`
    9. And `to_next->prev`.

    So potentially this code _could_ touch 9 or 10 cache lines, more
    than you said are supported for a single atomic event. What
    happens in that case? Does the code just always return false?
    The programmer best make sure the optimizer is keeping those
    temporary pointers off the stack; good luck with a debug build.

    Anyway, let's assume that's not the case, and we don't hit the
    pathological case with cache lines, and we're not dealing with
    edge cases at the front or end of the lists (so various next or
    prev pointers are not NULL). Why don't I try to explain what I
    think this code is doing, and you tell me whether I'm right or
    wrong.

    If I understand this correctly, the `esmLOCKload`s are simply
    loads, but they make sure that the compiler emits instructions
    that tie into your atomic support framework.

    So far so good.

    Earlier you wrote,
    "esmINTERFERENCE is a query on the state of th event and whether
    a 2nd party has performed a memory action that kills the event."
    Ok, I take this to mean _at the time of the store_, which
    appears later in the program. If it were simply a query at the
    time it was written in the code, then it seems like it opens a
    TOCTOU bug (what if something interfers after the check, but
    before the the locked store at the end?). Or perhaps it signals
    to the architecture that this is the point to which the
    processor should rollback, if anything subsequently fails; sort
    of a `setjmp` kind of thing that conceptually returns twice.

    The "if( esmINTERFERENCE() )" not only queries the state of the
    participating cache lines, it sets up a guarding-shadow over
    the STs to follow such that if killing-interference is detected
    none of the STs are performed AND control is transferred to the
    branch label (outside of the event). So not only does it query
    the state at it execution, it monitors that over the rest of
    the event.

    BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
    is that if we have reached the manifestation phase of the event
    (that is ready to do all the stores) then we should be able to
    drive the event forward by not sending the data in response to
    the SNOOP. We send a Nak (based on priority or the data)

    If the SNOOPing core receives a NaK and it is not attempting an
    event the request is simply reissued. If it is attempting, then
    its event fails. So a core benignly accessing data that another
    core is attempting an event upon, is delayed by an interconnect
    round trip--but if it is attempting, it knows its event will
    fail and it will either retry from before the event started,
    or it will goto the interferrence-label where other facilities
    \are at its disposal.

    Ok.

    Assuming nothing has poisoned the event, the various pointer
    stores inside the conditional are conditionally performed; the
    `esmLOCKstore` to `from->next` commits the event, at which
    point all side effects are exposed and durable.

    From a SW perspective, that is accurate enough.

    From a HW perspective, the participating STs are performed and
    held in a buffer, and when it is known the event has succeeded,
    a bit is flipped and these writes to memory all appear to have
    been performed in the same cycle. One cycle earlier and no
    interested 2nd party has seen any of them, the subsequent cycle
    any interested 2nd party will see them all.

    I should also note: ESM works even when there are no caches in
    the system. ...

    Ok, good to go.

    Is that correct? Let's suppose that it is.

    But later you wrote,

    So, if you want the property whereby the lock disappears on any
    control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>then you want to use my ATOMIC stuff; otherwise, you can use the
    normal ATOMIC primitives everyone and his brother provide.

    Well, what precisely do you mean here when you say, "if you want
    the property whereby the lock disappears on any control transfer
    out of the event"?

    If you want that property--you use the tools at hand.
    If you don't just use them as primitive generators.

    I wasn't clear enough here. I'm asking what, exactly, you mean
    by this _property_, not what you mean when you write that one
    can use these atomic events if one wants the property. That is,
    I'm asking you to precisely define what it means for a lock to
    "disappear".

    It seems clear enough _now_ that once the event concludes
    successfully the lock value is set to whatever it was set to in
    during the event, but that wasn't clear to me earlier and I
    wanted confirmation.

    In particular, it seems odd to me that one would bother with a
    lock of some kind during an event if it didn't remain set after
    the event. If you do all of your critical section stuff inside
    of the event, and setting the lock in the event is not visible
    until the event concludes, why bother? If on the other hand you
    use the event to set the lock, why bother doing additional work
    inside the event itself, but in this case I definitely don't
    want something else to come along and just unset the lock on its
    own, higher priority or not.

    What I _hope_ you mean is that if you transfer "out of the
    event" before the `esmLOCKstore` then anything that's been done
    since the "event" started is rolled back, but NOT if control
    transfer happens _after_ the `esmLOCKstore`. If THAT were the
    case, then this entire mechanism is useless.

    One can use a esmLOCKstore( *SP, 0 ) to abandon an event. HW
    detects that *SP is not participating in the event, and uses
    the lock bit in the instruction to know it is time to leave
    the event in the failed manner. No ST is performed to the
    non-participating cache line. {{I guess I could wrap the
    abandonment into a macro ...}}

    That's not what I was asking. I'm not interested in what
    happens if I transfer control out of the event myself, as it is
    obviously my responsibility to ensure whatever contract the
    hardware requires in that case is fulfilled. I'm interested in
    what happens control is usurped by some higher priority thing.

    Perhaps you would clarify?

    But for the time being, let's assume that `esmLOCKstore` is
    conceptually a commit that ends the atomic event, and the store
    is durable and observable after that. Let's see if we can
    construct a simple spinlock from these primitives.

    Notice that only the last ST has the esmLOCKstore() applied.
    HW knows that STs to participating lines have the ATOMIC
    property. The final ST uses the LOCK bit to denote the end
    of the event.

    Again, not what I was asking. I'm trying to understand what you
    meant when you talked about disappearing locks.

    Again, I think you meant that, if an atomic event that sets a
    lock variable fails, whether it got far enough along to set the
    variable inside the event or not is not observable if the event
    fails. But if the event succeeds, then the variable is set and
    that's that. IF that's the case, then great! But if not, then
    disappearing locks are bad. This is especially murky to me
    because there seem to be bits of OS primitives, like threads and
    priority, that are known directly to the hardware.

    typedef struct Spinlock Spinlock;
    struct Spinlock {
    uint32_t lock;
    };

    uint32_t
    xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
    {
    uint32_t v = esmLOCKload(&lk->lock);
    if (v == expected && !esmINTERFERENCE()) {
    esmLOCKstore(&lk->lock, val)
    }
    return v;
    }

    If the interference check is desired at all:: (see below)
    I think you should move interference check in front of the
    expected check, but that is a first impression--have to think
    about it more.

    Probably, given that it sets up a guard over the event, as you
    wrote above.

    void
    spin_acquire(volatile Spinlock *lk)
    {
    splhi();
    while (xcas(lk, 0, 1) != 0)
    relax(); // Spin loop intrinsic.
    }

    void
    spin_release(volatile Spinlock *lk)
    {
    xcas(lk, 1, 0);
    // Maybe we should `spllo` here, maybe not. Either choice
    // reflects lots of assumptions about whether or not we might
    // take and hold multiple locks concurrently, etc.
    spllo();
    }

    Does this seem correct? It is not immediately clear to me
    whether `esmINTERFERENCE` is actually needed here, but
    avoiding the store seemed fine.

    There are MANY cases where the interference query is not needed.
    Without the interference query, deadly interference backs the
    core up to the first instruction of the event. With the interference
    check, if deadly interference happens before the query, control
    reverts to before the event, and afterwards, control is transferred
    to the interference-label.

    One could imagine shorting
    `xcas` to simply:

    uint32_t
    xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
    {
    uint32_t v = esmLOCKload(&lk->lock);
    if (v == expected) esmLOCKstore(&lk->lock, val);
    return v;
    }

    Yes, unless you want the alternate exit strategy, this works fine.
    In fact test and set is written::

    boolean char TestAndSet( type char *p )
    {
    boolean t = esmLOCKload ( *p );
    esmLOCKstore( *p, 1 );
    return t;
    }

    Should interference transpire between load and store control
    reverts back to the LD again, so the return value is the LD
    which did not encounter interference.

    Ok, so perhaps this is a reasonable implementation of spinlock
    acquire, as I wrote to Stefan yesterday:

    void
    spinlock_acquire(volatile Spinlock *lk)
    {
    while (esmLOCKload(&lk->lock) != 0)
    cpu_spin_hint();
    esmLOCKstore(&lk->lock, 1);
    }

    Yes? If this is preempted, or another CPU wins the race on the
    lock, the hardware will back this up and start over in the while
    loop, right?

    For the sizes of CSs this covers--it prevent priority inversion.

    Except you've admitted that this is not a general solution, as
    it cannot "cover" arbitrary software-defined critical sections:
    the window isn't big enough and we've seen how even a fairly
    small problem may exceed the number of cache lines supported by
    one of these events.

    Remember only the lines participating in the event count. For
    example, say one had sprintf() inside the event. None of the
    lines touched by springt() are participating in the event,
    and are not counted against the 8 cache lines available.

    This doesn't change the point, though. Anything that touches
    more than 8 cache lines is not suitable for an atomic event.
    Insertion into a doubly linked list is already up to 6 lines,
    recall that my original hypothetical also involved popping off a
    queue, which adds another (the queue head). If I threw up
    incrementing a couple of counters I can't do it. And this seems
    like a perfectly reasonable real-world scenario; consider a work
    stealing thread scheduler that takes a task from one core to
    another and keeps some stats on how things move.

    Or maybe even just swapping places between two elements in a
    linked list. For example:

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    if (*head == a)
    *head = b;
    else if (*head == b)
    *head = a;

    an = a->next;
    ap = a->prev;
    bn = b->next;
    bp = b->prev;

    b->next = an;
    if (an != NULL)
    an->prev = b;
    b->prev = ap;
    if (ap != NULL)
    ap->next = b;

    a->next = bn;
    if (bn != NULL)
    bn->prev = a;
    a->prev = bp;
    if (bp != NULL)
    bp->next = a;
    }

    Perhaps we might wrap this in an atomic event by annotating it
    like so:

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    esmLOCKprefetch(*head);
    if (*head == a)
    *head = b;
    else if (*head == b)
    *head = a;

    an = esmLOCKload(a->next);
    ap = esmLOCKload(a->prev);
    bn = esmLOCKload(b->next);
    bp = esmLOCKload(b->prev);

    b->next = an;
    if (an != NULL) {
    esmLOCKprefetch(an->prev);
    an->prev = b;
    }
    b->prev = ap;
    if (ap != NULL) {
    esmLOCKprefetch(ap->next);
    ap->next = b;
    }

    a->next = bn;
    if (bn != NULL) {
    esmLOCKprefetch(bn->prev);
    bn->prev = a;
    }
    if (bp != NULL) {
    esmLOCKprefetch(bp->next);
    bp->next = a;
    }
    esmLOCKstore(a->prev, bp);
    }

    By my count, that could be touching 9 cache lines, even with
    all of the various structures well-aligned; indeed, if the Node
    type were annotated with a constraint that it be aligned on a
    cacheline boundary, that would be maximally pessimistic in this
    regard.

    Anyway, I'd like to understand what the hardware does when a
    program attempts an event where the conditions for success
    cannot be met.

    ---------------------
    So now I, as a systems software person, have to ask design
    question around how I want to use these architectural features,
    if at all, weighing various pros and cons. This is where I'm
    struggling with your proposal: given that I'm going to have to
    implement more general support for the things that the
    architecture gives me anyway, what do I gain by also supporting
    the specialized mechanisms?

    All I can say about this, was that when I presented ASF (ESM's
    predecessor) to Microsoft and spend a couple hours explaining
    to them how it works, MS wanted it implemented as rapidly as
    AMD could pull it off. Dave Cutler was particularly impressed.

    I'm not sure that a Dave C. ref is the flex you are making it
    out to be.

    And here I thought you had an imagination .....

    You admitted you're not an OS person and asked other people to
    poke holes in your proposal. I'm sorry if you find the holes
    I'm poking upsetting.

    I enjoy your poking of holes--I really do

    But you seem like the calculus student that keeps asking what
    problem can I solve now that I can differentiate.

    Huh. Well, I don't know what to tell ya about that, guy.

    Look, here's the deal. This all sounds neat. And my point is
    not to bore you with a bunch of impudent challenges. You've
    obviously had a long and distinguished career, and I can't
    imagine what purpose that would serve, other than being a jerk.
    I'm not interested in that.

    But there isn't a lot of publicly available documentation for
    any of this, at least none that I could find, and from the way
    it's been described thus far it seems like there are a number of
    edge cases that it is not clear how one is supposed to address.
    So I want to understand the boundaries of where these hardware
    facilities are applicable, given architectural limitations. I
    think this thread has shown that those boundaries exist. The
    universe of interesting and useful software is big, and someone,
    somewhere, is going to trip over those limitations sooner or
    later. Once we accept that, we must ask, "Then what?" It seems
    obvious that programmers are going to have to find techniques to
    work around those limitations, and that the mechanisms chosen
    are unknowable to hardware.

    But now we're back in the same boat of software having to deal
    with all the thorny problems that these mechanisms were designed
    to address in the first place, like priority inversion: take the
    spinlock from earlier, the hardware has no semantic knowledge
    that the lock is used as a mutex, and can't know; it's just some
    bit in memory somewhere that was set during an atomic event.

    Maybe ESM helps in a race when trying to acquire the spinlock,
    but once it's acquired, it's acquired, and because the spinlock
    is not a primitive understood by the hardware, the hardware has
    no way of incorporating the spinlock's semantics into its view
    of priority, threading, etc.

    So while an atomic event might help me in the implementation of
    a generic mutex of some kind, so that I can protect critical
    sections that can't be handled with the atomic event framework
    directly because of their size, that's the extent of the help it
    can give me, because once the mutex is acquired the hardware has
    no insight into the fact that it's a mutex. So if some higher
    priority thing subsequently comes along and preempts the thread
    holding that lock and tries to take that lock itself, then the
    hardware can't help me out anymore, and I'm back to a
    priority-inversion deadlock that has to be dealt with by
    software.

    The point is, even if the hardware gives me a cool
    mechanism for helping prevent priority inversion problems,
    programmers are inevitably going to have to handle those
    situations themselves anyway.

    This is the sort of thing I'm trying to communicate, it's really
    not me sitting here tossing out barbs like, "oh yeah?! Well,
    how about THIS one, smart guy?"

    To use your mathematics analogy, you are describing a technique
    for differeniating real single variable continuous functions.
    I'm pointing out that this is not applicable to all functions,
    because discontinuous functions and higher order spaces exist,
    and giving examples of such functions that are important parts
    of real-world problems, not just abstract distractions.

    I can't help feel like you're telling me to just stick to
    polynomials over R because calculus is useful. Calculus may be
    useful, but sometimes you really do need analysis.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to monnier@iro.umontreal.ca on Thu Mar 27 18:34:00 2025
    In article <jwv7c4ajq11.fsf-monnier+comp.arch@gnu.org>,
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    The illustrated code is using 6 participating cache lines.
    Where local variables are kept (stack, registers) does not
    count against the tally of participating cache lines.

    Huh. How would this handle something like an MCS lock, where
    the lock node may be on a stack, though accessible globally in
    some virtual address space?

    ESM doesn't distinguish the stack from other places. It just doesn't
    pay attention to memory accesses that aren't marked specially. So, if
    you lock is on the stack it will still work, as long as you use
    `esmLOCKload` to read it.

    Ah, I see. Ok, that makes more sense.

    Again, not what I was asking. I'm trying to understand what you
    meant when you talked about disappearing locks.

    He was talking about the conceptual lock used internally by the CPU
    to implement an ESM event, that gives their name to `esmLOCK*`.
    I.e. the same kind of lock used in other CPUs to implement LL/SC and
    that gives its name to "load locked" (tho it's also sometimes named
    "load linked").

    Ok. Thanks, Stefan.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 27 14:27:22 2025
    The illustrated code is using 6 participating cache lines.
    Where local variables are kept (stack, registers) does not
    count against the tally of participating cache lines.

    Huh. How would this handle something like an MCS lock, where
    the lock node may be on a stack, though accessible globally in
    some virtual address space?

    ESM doesn't distinguish the stack from other places. It just doesn't
    pay attention to memory accesses that aren't marked specially. So, if
    you lock is on the stack it will still work, as long as you use
    `esmLOCKload` to read it.

    Again, not what I was asking. I'm trying to understand what you
    meant when you talked about disappearing locks.

    He was talking about the conceptual lock used internally by the CPU
    to implement an ESM event, that gives their name to `esmLOCK*`.
    I.e. the same kind of lock used in other CPUs to implement LL/SC and
    that gives its name to "load locked" (tho it's also sometimes named
    "load linked").


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Thu Mar 27 19:18:35 2025
    On Thu, 27 Mar 2025 17:19:21 +0000, Dan Cross wrote:

    In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
    [snip]
    1. `from` may point to a cache line
    2. `to` may point to a different cache line.
    3. Does your architecture use a stack?

    No

    You make multiple references to a stack just below. I
    understand stack references are excluded from the atomic event,
    but I asked whether the architecture uses a stack. Does it, or
    doesn't it?

    What sort of alignment criteria do you impose?

    On the participating data none. On the cache line--cache line.

    I meant for data on the stack. The point is moot, though, since
    as you said stack data does not participate in the atomic event.

    Stack does not NECESSARILY participate. However, if you perform
    a esmLOCKload( SP[23] ) then that line on the stack is participating.

    At first blush, it seems to me that
    the pointers `from_next`, `from_prev`, and `to_next` could be
    on the stack and if so, will be on at least one cache line,
    and possibly 2, if the call frame for `MoveElement` spans
    more than one.

    The illustrated code is using 6 participating cache lines.
    Where local variables are kept (stack, registers) does not
    count against the tally of participating cache lines.

    Huh.

    There are 6 esmLOCKxxxxx() so there are 6 participating lines--
    and these are absolutely independent from where the variables
    are located. The variables carry information but do not part-
    ticipate they carry data between operations on the particupating
    lines.

    How would this handle something like an MCS lock, where
    the lock node may be on a stack, though accessible globally in
    some virtual address space?

    As illustrated above esmLOCKload( SP[23]) will cause a line on
    the stack to participate in the event.
    ----------------
    But later you wrote,

    So, if you want the property whereby the lock disappears on any
    control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>>then you want to use my ATOMIC stuff; otherwise, you can use the
    normal ATOMIC primitives everyone and his brother provide.

    Well, what precisely do you mean here when you say, "if you want
    the property whereby the lock disappears on any control transfer
    out of the event"?

    If you want that property--you use the tools at hand.
    If you don't just use them as primitive generators.

    I wasn't clear enough here. I'm asking what, exactly, you mean
    by this _property_,

    The property that if an interrupt (or exception or SVC or SVR)
    prevents executing the event to its conclusion, HW makes the
    event look like it never started. So that when/if control returns
    we have another chance to execute the event as-if ATOMICally.

    not what you mean when you write that one
    can use these atomic events if one wants the property. That is,
    I'm asking you to precisely define what it means for a lock to
    "disappear".

    It seems clear enough _now_ that once the event concludes
    successfully the lock value is set to whatever it was set to in
    during the event, but that wasn't clear to me earlier and I
    wanted confirmation.

    Not "the lock value" but "all participating values" are set
    and become visible system-wide in a single instant. No 3rd
    party sees some set and others remain unset. Everybody sees
    all of then change value between this-clock and its successor.

    In particular, it seems odd to me that one would bother with a
    lock of some kind during an event if it didn't remain set after
    the event.

    Consider CompareTripleSwapFour() as an atomic primitive. How
    would you program this such that nobody nowhere could ever
    tell that the four updated locations changes in any order
    then simultaneously ??? Without using a LOCK to guard the
    event ??? And without demanding that all four updates are
    to the same cache line ???

    {{Nobody nowhere included ATOMIC-blind DMA requests}}

    If you do all of your critical section stuff inside
    of the event, and setting the lock in the event is not visible
    until the event concludes, why bother? If on the other hand you
    use the event to set the lock, why bother doing additional work
    inside the event itself, but in this case I definitely don't
    want something else to come along and just unset the lock on its
    own, higher priority or not.

    It is clear you do not understand the trouble HW takes to implement
    even DCADS where between any 2 clocks, one (or more) SNOOPs can
    take the lines you are DCADSing. It is this property of cache
    coherence that gets in the way of multi-operation ATOMIC events.

    In 2004 Fred Weber came to me and ask: "Why can't we (AMD) give
    Microsoft the DCADS they want. I dug into it, and ASF was my
    solution in x86 ISA, ESM is my solution in RISC-based ISA. It
    removes the need to add ATOMIC primitives to ISA over generations.

    It also provides the ability to do other things--which apparently
    you will not use--sort of like complaining that your new car has
    auto-park a feature you will never use.

    What I _hope_ you mean is that if you transfer "out of the
    event" before the `esmLOCKstore` then anything that's been done
    since the "event" started is rolled back, but NOT if control
    transfer happens _after_ the `esmLOCKstore`. If THAT were the
    case, then this entire mechanism is useless.

    One can use a esmLOCKstore( *SP, 0 ) to abandon an event. HW
    detects that *SP is not participating in the event, and uses
    the lock bit in the instruction to know it is time to leave
    the event in the failed manner. No ST is performed to the
    non-participating cache line. {{I guess I could wrap the
    abandonment into a macro ...}}


    Again, not what I was asking. I'm trying to understand what you
    meant when you talked about disappearing locks.

    If the event cannot be completed due to interrupt or exception or
    SCVC or SVR, the HW backs up the event such that is looks like
    it never started. SO when control return, the ATOMIC event runs
    in its entirety ATOMICally !! So, the illusion of ATOMICITY is
    preserved.

    Again, I think you meant that, if an atomic event that sets a
    lock variable fails,

    The event has NO lock variable--it has up to 8 cache line address
    monitors.

    whether it got far enough along to set the
    variable inside the event or not is not observable if the event
    fails.

    Non-participating cache lines can be used to leak information
    out of the event (like for debug purposes). Participating data
    does not leak.

    But if the event succeeds, then the variable is set and
    that's that. IF that's the case, then great! But if not, then
    disappearing locks are bad.

    It is NOT a FRIGGEN LOCK--it is a set of address monitors which
    are used to determine the success of the event -- the monitors
    remain invisible to SW just like miss buffers remain invisible
    to SW. The monitors are not preservable across privilege changes.

    This is especially murky to me
    because there seem to be bits of OS primitives, like threads and
    priority, that are known directly to the hardware.

    Just its instantaneous privilege level--which is congruent to
    its register file, Root Pointer, ASID, and operating modes.
    ------------------

    Ok, so perhaps this is a reasonable implementation of spinlock
    acquire, as I wrote to Stefan yesterday:

    void
    spinlock_acquire(volatile Spinlock *lk)
    {
    while (esmLOCKload(&lk->lock) != 0)
    cpu_spin_hint();
    esmLOCKstore(&lk->lock, 1);
    }

    Yes? If this is preempted, or another CPU wins the race on the
    lock, the hardware will back this up and start over in the while
    loop, right?

    Yes.
    ------------
    Remember only the lines participating in the event count. For
    example, say one had sprintf() inside the event. None of the
    lines touched by springt() are participating in the event,
    and are not counted against the 8 cache lines available.

    This doesn't change the point, though. Anything that touches
    more than 8 cache lines is not suitable for an atomic event.

    I see what you are saying, but the event is allowed to have an
    unbounded amount of touches to non-participating lies.

    Insertion into a doubly linked list is already up to 6 lines,

    I moved a doubly linked element from a CDS and inserted it
    somewhere else doubly-linked in same CDS in 6 cache lines.
    You need to refine how you count.

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }


    recall that my original hypothetical also involved popping off a
    queue, which adds another (the queue head). If I threw up
    incrementing a couple of counters I can't do it. And this seems
    like a perfectly reasonable real-world scenario; consider a work
    stealing thread scheduler that takes a task from one core to
    another and keeps some stats on how things move.

    Or maybe even just swapping places between two elements in a
    linked list. For example:

    The example has 3 structures each requiring 3 participating lines.
    {themself, the struct their next pointer points at and the struct
    their prev pointer points at}

    3×3 = 9 and there is no getting around it.
    -------------------------

    But there isn't a lot of publicly available documentation for
    any of this, at least none that I could find, and from the way
    it's been described thus far it seems like there are a number of
    edge cases that it is not clear how one is supposed to address.
    So I want to understand the boundaries of where these hardware
    facilities are applicable, given architectural limitations. I
    think this thread has shown that those boundaries exist. The
    universe of interesting and useful software is big, and someone,
    somewhere, is going to trip over those limitations sooner or
    later. Once we accept that, we must ask, "Then what?" It seems
    obvious that programmers are going to have to find techniques to
    work around those limitations, and that the mechanisms chosen
    are unknowable to hardware.

    Fair enough.

    But now we're back in the same boat of software having to deal
    with all the thorny problems that these mechanisms were designed
    to address in the first place, like priority inversion: take the
    spinlock from earlier, the hardware has no semantic knowledge
    that the lock is used as a mutex, and can't know; it's just some
    bit in memory somewhere that was set during an atomic event.

    It is not a FRIGGEN lock, it is a set of address monitors which
    guard the instructions in the event. A Lock is a SW concept
    which requires a value change to announce to others that "you
    got it".

    Maybe ESM helps in a race when trying to acquire the spinlock,
    but once it's acquired, it's acquired, and because the spinlock
    is not a primitive understood by the hardware, the hardware has
    no way of incorporating the spinlock's semantics into its view
    of priority, threading, etc.

    Consider a 1024 core system and the time-quantum goes off and
    every core wants to rotate the thread it is running; putting
    it at the back of the execution queue and taking off the one on
    the front to run. Nick McLaren timed a big SUN server (100 core)
    on this and it could take up to 6 seconds !! for something that
    should take a few milliseconds. This is because of the BigO( n^3 )
    nature of bus traffic wrt SNOOPing on that system.

    One can use ESM to convert that particular case to BigO( 3 )
    or in the general case of random insert and removal times:
    BigO( ln( n ) ).

    So while an atomic event might help me in the implementation of
    a generic mutex of some kind, so that I can protect critical
    sections that can't be handled with the atomic event framework
    directly because of their size, that's the extent of the help it
    can give me, because once the mutex is acquired the hardware has
    no insight into the fact that it's a mutex.

    Exactly:: once you convert address-based mutual exclusion to
    data-based mutual exclusion, HW no longer sees it at all.

    So if some higher
    priority thing subsequently comes along and preempts the thread
    holding that lock and tries to take that lock itself, then the
    hardware can't help me out anymore, and I'm back to a
    priority-inversion deadlock that has to be dealt with by
    software.

    HW can only "take" address based locking, not data-based locking.

    The point is, even if the hardware gives me a cool
    mechanism for helping prevent priority inversion problems,
    programmers are inevitably going to have to handle those
    situations themselves anyway.

    Never promised you a rose garden.

    This is the sort of thing I'm trying to communicate, it's really
    not me sitting here tossing out barbs like, "oh yeah?! Well,
    how about THIS one, smart guy?"

    Look at it this way:: over the decade from 2000-2010 x86 added
    a new ATOMIC instruction every iteration.

    Using ASF or ESM; my architecture would never have to add one.
    SW has the tools to build what it wants/needs.


    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Thu Mar 27 20:31:42 2025
    On Thu, 27 Mar 2025 9:55:08 +0000, Terje Mathisen wrote:

    Stefan Monnier wrote:
    a going to make a big difference here. But I also wonder at the
    internal implementation of these instructions in e.g. that
    Fujitsu chip that allows throughput to stay more or less
    constant as the number of cores increases; I'd think that cache
    coherency traffic as threads contend for a single cache line
    would create a bottleneck leading to (at least) a linear
    slow down.

    IIUC the idea is that the atomic-add is sent to the memory and performed
    there, instead of moving the data to the cache. So there's no
    contention for the cache line because the data is not in a cache at all.

    This one I have never understood the use/need for, except to provide
    global shared counters of some sort:

    In that case it would make perfect sense to allow any core to send (and immediately forget) a message "increment counter X by 1", but at least
    in my own coding, any LOCK XADD has been performed because I now need to
    do something with either the previous or the updated value of the
    counter.

    Having the atomic-add performed by a memory controller, returning the
    result, would make it almost perfectly scalable, but would also limit
    the speed of this operation to the bus transaction speed, even when
    there is zero contention, right?

    Yes, BUT::

    If the atomic-add is performed at the level in the cache having write permission on the data, then, much of the time, the interconnect
    speed and latency becomes way-less pronounced.

    I guess if the atomic-add has been implemented at all/most cache levels
    as well, then it would indeed perform better for all workloads, the
    problem is that under contention you still have to flush the variable
    out to the RAM array, right?

    My point exactly.

    Let's face it, the size of an integer adder compared to the area of
    the SRAMs making up the cache is insignificant--even at the L1 level.


    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Thu Mar 27 21:47:03 2025
    On Thu, 27 Mar 2025 17:19:21 +0000, Dan Cross wrote:

    In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    -------------------
    Or maybe even just swapping places between two elements in a
    linked list. For example:

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    esmLOCKprefetch(*head);

    This should be a load not prefetch--you want the value of *head

    if (*head == a) // see !
    *head = b;
    else if (*head == b)
    *head = a;

    There is a ESM rule that states:: all participating cache lines
    must be touched before any participating cache lines can be
    modified.

    Also note: Participating cache lines are checked for write permission
    at touch time, and on cache miss, read with intent to modify.

    an = esmLOCKload(a->next);
    ap = esmLOCKload(a->prev);
    bn = esmLOCKload(b->next);
    bp = esmLOCKload(b->prev);

    b->next = an;
    if (an != NULL) {
    esmLOCKprefetch(an->prev);
    an->prev = b;
    }
    b->prev = ap;
    if (ap != NULL) {
    esmLOCKprefetch(ap->next);
    ap->next = b;
    }

    a->next = bn;
    if (bn != NULL) {
    esmLOCKprefetch(bn->prev);
    bn->prev = a;
    }
    if (bp != NULL) {
    esmLOCKprefetch(bp->next);
    bp->next = a;
    }
    esmLOCKstore(a->prev, bp);
    }


    - Dan C.

    What I think you want:: (ignoring the 9 participants limit)

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    top_of_ATOMIC_event:

    // this is the recovery point is you don't use esmINTERFERENCE()
    // the very next instruction begins the event.

    esmLOCKprefetch( an = esmLOCKload(a->next) );
    esmLOCKprefetch( ap = esmLOCKload(a->prev) );
    esmLOCKprefetch( bn = esmLOCKload(b->next) );
    esmLOCKprefetch( bp = esmLOCKload(b->prev) );

    Node *Ehead = esmLOCKload(*head);

    // by placing all the the touching before any manifestation, you put
    // all the touch latency* in the code before it has tried to damage any
    // participating memory location. (*) and TLB latency and 2nd party
    // observation of your event.

    // this would be the point where you would insert if( esmINTERFERENCE(
    ))
    // if you wanted control at a known failure point rather than at the
    // top of the event on failure.

    if (Ehead == a)
    *head = b;
    else if (Ehead == b)
    *head = a;

    b->next = an;
    if (an != NULL) {
    an->prev = b;
    }
    b->prev = ap;
    if (ap != NULL) {
    ap->next = b;
    }

    a->next = bn;
    if (bn != NULL) {
    bn->prev = a;
    }
    if (bp != NULL) {
    bp->next = a;
    }
    esmLOCKstore(a->prev, bp);
    }

    // now manifestation has lowest possible latency (as seen by this core
    alone)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Fri Mar 28 01:01:59 2025
    In article <601781e2c91d42a73526562b419fdf02@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 27 Mar 2025 17:19:21 +0000, Dan Cross wrote:
    In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
    [snip]
    1. `from` may point to a cache line
    2. `to` may point to a different cache line.
    3. Does your architecture use a stack?

    No

    You make multiple references to a stack just below. I
    understand stack references are excluded from the atomic event,
    but I asked whether the architecture uses a stack. Does it, or
    doesn't it?

    What sort of alignment criteria do you impose?

    On the participating data none. On the cache line--cache line.

    I meant for data on the stack. The point is moot, though, since
    as you said stack data does not participate in the atomic event.

    Stack does not NECESSARILY participate. However, if you perform
    a esmLOCKload( SP[23] ) then that line on the stack is participating.

    Very good.

    At first blush, it seems to me that
    the pointers `from_next`, `from_prev`, and `to_next` could be
    on the stack and if so, will be on at least one cache line,
    and possibly 2, if the call frame for `MoveElement` spans
    more than one.

    The illustrated code is using 6 participating cache lines.
    Where local variables are kept (stack, registers) does not
    count against the tally of participating cache lines.

    Huh.

    There are 6 esmLOCKxxxxx() so there are 6 participating lines--
    and these are absolutely independent from where the variables
    are located. The variables carry information but do not part-
    ticipate they carry data between operations on the particupating
    lines.

    That was clearly in the code. That was not what the "Huh" was
    in resposne to.

    How would this handle something like an MCS lock, where
    the lock node may be on a stack, though accessible globally in
    some virtual address space?

    As illustrated above esmLOCKload( SP[23]) will cause a line on
    the stack to participate in the event.

    Ok, good to know.

    ----------------
    But later you wrote,

    So, if you want the property whereby the lock disappears on any >>>>>control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>>>then you want to use my ATOMIC stuff; otherwise, you can use the >>>>>normal ATOMIC primitives everyone and his brother provide.

    Well, what precisely do you mean here when you say, "if you want
    the property whereby the lock disappears on any control transfer
    out of the event"?

    If you want that property--you use the tools at hand.
    If you don't just use them as primitive generators.

    I wasn't clear enough here. I'm asking what, exactly, you mean
    by this _property_,

    The property that if an interrupt (or exception or SVC or SVR)
    prevents executing the event to its conclusion, HW makes the
    event look like it never started. So that when/if control returns
    we have another chance to execute the event as-if ATOMICally.

    Ok. So let's be crystal clear here: after the event concludes
    successfully, a subsequent exception, interrupt, or whatever,
    will not magically roll back any of the values set and made
    visible system-wide as a result of the successful conclusion of
    the atomic event. Correct? That's what I've been asking.

    not what you mean when you write that one
    can use these atomic events if one wants the property. That is,
    I'm asking you to precisely define what it means for a lock to
    "disappear".

    It seems clear enough _now_ that once the event concludes
    successfully the lock value is set to whatever it was set to in
    during the event, but that wasn't clear to me earlier and I
    wanted confirmation.

    Not "the lock value" but "all participating values" are set
    and become visible system-wide in a single instant.

    Sure. But there was only one participating value in the
    spinlock example. ;-}

    No 3rd
    party sees some set and others remain unset. Everybody sees
    all of then change value between this-clock and its successor.

    Sure. You've said this all along, and I've never disputed it.

    In particular, it seems odd to me that one would bother with a
    lock of some kind during an event if it didn't remain set after
    the event.

    Consider CompareTripleSwapFour() as an atomic primitive. How
    would you program this such that nobody nowhere could ever
    tell that the four updated locations changes in any order
    then simultaneously ??? Without using a LOCK to guard the
    event ??? And without demanding that all four updates are
    to the same cache line ???

    {{Nobody nowhere included ATOMIC-blind DMA requests}}

    The fundamental issue here is that, when I refer to a "lock" I
    have been talking about a software mutual exclusion primitive,
    and you are referring to something else entirely.

    I would have thought my meaning was clear from context;
    evidently not.

    If you do all of your critical section stuff inside
    of the event, and setting the lock in the event is not visible
    until the event concludes, why bother? If on the other hand you
    use the event to set the lock, why bother doing additional work
    inside the event itself, but in this case I definitely don't
    want something else to come along and just unset the lock on its
    own, higher priority or not.

    It is clear you do not understand the trouble HW takes to implement
    even DCADS where between any 2 clocks, one (or more) SNOOPs can
    take the lines you are DCADSing. It is this property of cache
    coherence that gets in the way of multi-operation ATOMIC events.

    As above, when I refer to "setting the lock" I am referring to
    acquiring a mutex; a software primitive. I am not referring to
    what the hardware does internally. Consider what I wrote above,
    the section that you quoted, in that context.

    In 2004 Fred Weber came to me and ask: "Why can't we (AMD) give
    Microsoft the DCADS they want. I dug into it, and ASF was my
    solution in x86 ISA, ESM is my solution in RISC-based ISA. It
    removes the need to add ATOMIC primitives to ISA over generations.

    It also provides the ability to do other things--which apparently
    you will not use--sort of like complaining that your new car has
    auto-park a feature you will never use.

    "It is clear you do not understand the trouble SW takes to
    implement even simple things that are beyond the capability of
    hardware."

    (Back at you. I think I've been pretty polite here; I'm getting
    tired of the barbs.)

    What I _hope_ you mean is that if you transfer "out of the
    event" before the `esmLOCKstore` then anything that's been done
    since the "event" started is rolled back, but NOT if control
    transfer happens _after_ the `esmLOCKstore`. If THAT were the
    case, then this entire mechanism is useless.

    One can use a esmLOCKstore( *SP, 0 ) to abandon an event. HW
    detects that *SP is not participating in the event, and uses
    the lock bit in the instruction to know it is time to leave
    the event in the failed manner. No ST is performed to the >>>non-participating cache line. {{I guess I could wrap the
    abandonment into a macro ...}}

    Again, not what I was asking. I'm trying to understand what you
    meant when you talked about disappearing locks.

    If the event cannot be completed due to interrupt or exception or
    SCVC or SVR, the HW backs up the event such that is looks like
    it never started. SO when control return, the ATOMIC event runs
    in its entirety ATOMICally !! So, the illusion of ATOMICITY is
    preserved.

    Sigh. You've said this many times over now. I get it. I'm not
    disputing it, nor do I believe I ever have.

    Again, when I said "lock" earlier, I am referring to a mutux
    from the perspective of software. That is, the value of the
    structure member that I called "lock" in the version of the
    `Spinlock` struct that I posted however-many responses ago, and
    that I keep referring to in example code. I am not asking about
    what the hardware does internally. I'm not talking about locked
    bus cycles or whatever. I'm not referring to cache line address
    monitors or any of these other things that you keep referring
    to. I'm talking the the value of the structure member called
    `lock` in the `Spinlock` C structure I posted earlier.

    What I have been asking, over and over again, is whether that
    _value_ "disappears" _after_ an atomic event successfully
    concludes. That is, the value that was written into the memory,
    not whatever internal marker you use in hardware to monitor and
    guard those stores until the atomic event completes or is
    otherwise exited, aborted, cancelled, interrupted, or however
    you prefer to refer to it not completing successfully. That is
    a totally separate thing and I not presume to speak of how its
    done, though in fairness to you you've made it reasonably clear.

    Does that clarify what I have been asking about? I honestly
    thought this would be obvious from context.

    Again, I think you meant that, if an atomic event that sets a
    lock variable fails,

    The event has NO lock variable--it has up to 8 cache line address
    monitors.

    The structure member is literally called `lock` and the context
    is software spinlocks. Note, spinlocks because they're simple
    and easy to discuss, not because they're great mutex primtives.

    As frustrating as this evidently is to you, it strikes me that
    it actually reinforces my point: hardware has no insight into
    what the software is _doing_ at a higher semantic level.

    whether it got far enough along to set the
    variable inside the event or not is not observable if the event
    fails.

    Non-participating cache lines can be used to leak information
    out of the event (like for debug purposes). Participating data
    does not leak.

    But if the event succeeds, then the variable is set and
    that's that. IF that's the case, then great! But if not, then
    disappearing locks are bad.

    It is NOT a FRIGGEN LOCK--

    I am talking about the spinlock. That is a software object, and
    in that context, it is very much called a "lock".

    I understand when you say that it is not a lock, it is not a
    lock in the sense relevant to your world in hardware, but that's
    not what _I'm_ talking about.

    I'm trying to square what you meant when you talked about
    "disappearing locks" with what happens to those software objects
    _after_ an atomic event modifying them successfully completes.

    Evidently you were talking about something else entirely, and
    that's fine. I gather it has nothing to do with _my_ notion of
    locks after they've been successfully acquired; I was just
    trying to confirm that.

    it is a set of address monitors which
    are used to determine the success of the event -- the monitors
    remain invisible to SW just like miss buffers remain invisible
    to SW. The monitors are not preservable across privilege changes.

    See above. Tell me what you want me to call it and I'll use
    that terminology instead.

    This is especially murky to me
    because there seem to be bits of OS primitives, like threads and
    priority, that are known directly to the hardware.

    Just its instantaneous privilege level--which is congruent to
    its register file, Root Pointer, ASID, and operating modes.

    What is "it" here?

    ------------------

    Ok, so perhaps this is a reasonable implementation of spinlock
    acquire, as I wrote to Stefan yesterday:

    void
    spinlock_acquire(volatile Spinlock *lk)
    {
    while (esmLOCKload(&lk->lock) != 0)
    cpu_spin_hint();
    esmLOCKstore(&lk->lock, 1);
    }

    Yes? If this is preempted, or another CPU wins the race on the
    lock, the hardware will back this up and start over in the while
    loop, right?

    Yes.
    ------------
    Remember only the lines participating in the event count. For
    example, say one had sprintf() inside the event. None of the
    lines touched by springt() are participating in the event,
    and are not counted against the 8 cache lines available.

    This doesn't change the point, though. Anything that touches
    more than 8 cache lines is not suitable for an atomic event.

    I see what you are saying, but the event is allowed to have an
    unbounded amount of touches to non-participating lies.

    I don't see how that's helpful if I need to touch data in more
    than 8 cache lines atomically.

    Insertion into a doubly linked list is already up to 6 lines,

    I moved a doubly linked element from a CDS and inserted it
    somewhere else doubly-linked in same CDS in 6 cache lines.
    You need to refine how you count.

    <Tongue-in-cheek mode on>
    Ah, yes. The program that would dump core if you tried to move
    from the head of a list or the rear and that couldn't insert at
    the head pointer was, in fact, called `MoveElement`. My
    mistake, even if it was clear that that's what I was talking
    about, given that quoted where I referred to it in the context
    of work stealing later on. ;-P

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }

    If you're going to be pendatic when I slip up and call "move"
    "insert", then let's be pendantic.

    This code won't compile, because of the obvious error of not
    defining `tn`. Even if you fix that, it's not correct if you
    try to insert at the end of the list. It's not clear now you
    would insert at the head of a list, or into an empty list, but
    that depends on how you represent the list itself. ie, the
    head may just be a pointer to the first element, or it might be
    an otherwise unused node; something like this:

    typedef struct List List;
    struct List {
    Element head;
    };

    where the first element in the list is always `head.next`.

    It would help if you used standard types, like `bool`, which has
    been available standard for more than 25 years now since
    introduced in C99 as `_Bool` with accompanying typedefs and
    macros in `stdbool.h` for `bool`, `true` and `false`.

    And doesn't this presume that `el->next`, `el->prev` and `el`
    are all on the same cache line? Similarly with `tn` and
    `tn->prev`?

    You need to refine how you write programs. ;-}

    Here's a better version, that assumes the list representation
    above.

    #include <stddef.h>
    #include <stdbool.h>

    typedef struct Element Element;
    struct Element {
    Element *next;
    Element *prev;
    };

    typedef struct List List;
    struct List {
    Element head;
    };

    Element *
    ListBegin(List *list)
    {
    return list->head.next;
    }

    bool
    InsertElement(Element *el, Element *to)
    {
    Element *tn = esmLOCKload(to->next);
    esmLOCKprefetch(el->next);
    esmLOCKprefetch(el->prev);
    if (tn != NULL)
    esmLOCKprefetch(tn->prev);

    if (!esmINTERFERENCE()) {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore(tn->prev, el);
    return true;
    }
    return false;
    }

    Inserting at the head, or if the list is empty, are both easy
    now: `InsertElement(&list.head, el)`. But note, again, that the
    first element of the list is really `list.head.next`, hence the
    `ListBegin` function.

    Here I count up to three, but possibly up to 4, cache lines.

    Note that if you did Not represent the list this way, in order
    to support generalized insertions anywhere in the list, you'd
    either need a circular representation (which is annoying to use
    in practice), or you'd most likely have to support separate
    head and tail pointers and two operations, one to insert before,
    and another after, the 'to' pointer. But then you'd have to be
    prepared to update either the head or tail pointers (or both if
    the list is empty); either way it's potentially two more cache
    lines to account for, though if the list is empty you don't have
    to touch `tn->prev`.

    recall that my original hypothetical also involved popping off a
    queue, which adds another (the queue head). If I threw up
    incrementing a couple of counters I can't do it. And this seems
    like a perfectly reasonable real-world scenario; consider a work
    stealing thread scheduler that takes a task from one core to
    another and keeps some stats on how things move.

    Or maybe even just swapping places between two elements in a
    linked list. For example:

    The example has 3 structures each requiring 3 participating lines.
    {themself, the struct their next pointer points at and the struct
    their prev pointer points at}

    3×3 = 9 and there is no getting around it.

    Yup. So what does the hardware do if I try it?

    -------------------------

    But there isn't a lot of publicly available documentation for
    any of this, at least none that I could find, and from the way
    it's been described thus far it seems like there are a number of
    edge cases that it is not clear how one is supposed to address.
    So I want to understand the boundaries of where these hardware
    facilities are applicable, given architectural limitations. I
    think this thread has shown that those boundaries exist. The
    universe of interesting and useful software is big, and someone,
    somewhere, is going to trip over those limitations sooner or
    later. Once we accept that, we must ask, "Then what?" It seems
    obvious that programmers are going to have to find techniques to
    work around those limitations, and that the mechanisms chosen
    are unknowable to hardware.

    Fair enough.

    But now we're back in the same boat of software having to deal
    with all the thorny problems that these mechanisms were designed
    to address in the first place, like priority inversion: take the
    spinlock from earlier, the hardware has no semantic knowledge
    that the lock is used as a mutex, and can't know; it's just some
    bit in memory somewhere that was set during an atomic event.

    It is not a FRIGGEN lock, it is a set of address monitors which
    guard the instructions in the event.

    Now I have to wonder if you're deliberatly ignoring the context.

    I called it a "lock" in the same sentence I said it was a
    "spinlock". Are you seriously going to tell me that you could
    not figure out that the "lock" I was referring to was the
    spinlock? You know that that's a software thing; you yourself
    said that.

    Note the words, "it's just some bit in memory that _was set
    during an atomic event._" Is it not clear that the event is
    over? It's done. It was successful. All of the address
    monitors guarding instructions in the event are no longer
    relevant; they've been retired or whatever.

    Now, some (software) thread holds a (software) (spin)lock.
    I've tried over and over to make this clear, yet you insist on
    going back to talking about what happens _during_ an atomic
    event. I'm talking about _after_ the event completed
    _succesfully_.

    A Lock is a SW concept
    which requires a value change to announce to others that "you
    got it".

    That's exactly what `spinlock_acquire` from above _does_, isn't
    it?

    You cannot seriously expect me to accept that you did not
    understand this.

    Maybe ESM helps in a race when trying to acquire the spinlock,
    but once it's acquired, it's acquired, and because the spinlock
    is not a primitive understood by the hardware, the hardware has
    no way of incorporating the spinlock's semantics into its view
    of priority, threading, etc.

    Consider a 1024 core system and the time-quantum goes off and
    every core wants to rotate the thread it is running; putting
    it at the back of the execution queue and taking off the one on
    the front to run. Nick McLaren timed a big SUN server (100 core)
    on this and it could take up to 6 seconds !! for something that
    should take a few milliseconds. This is because of the BigO( n^3 )
    nature of bus traffic wrt SNOOPing on that system.

    This was solved with per-core run queues.

    You're senior enough that I would think it clear that spinlocks
    are just a convenient illustrative example here because they're
    so simple, but their pitfalls as core counts increase are well
    studied and understood. In the real world, mutex structures
    like MCS locks or CLH locks scale approximately linearly with
    core-count under contention. I posted a reference to this
    earlier, but here it is again. You may have a hard time
    accepting it because they call the things that they refer to,
    "locks". ;-}
    https://pdos.csail.mit.edu/papers/linux:lock.pdf

    One can use ESM to convert that particular case to BigO( 3 )
    or in the general case of random insert and removal times:
    BigO( ln( n ) ).

    Modern scheduling algorithms already run in amortized constant
    time on conventional hardware.

    So while an atomic event might help me in the implementation of
    a generic mutex of some kind, so that I can protect critical
    sections that can't be handled with the atomic event framework
    directly because of their size, that's the extent of the help it
    can give me, because once the mutex is acquired the hardware has
    no insight into the fact that it's a mutex.

    Exactly:: once you convert address-based mutual exclusion to
    data-based mutual exclusion, HW no longer sees it at all.

    That's what I've been saying all along.

    So if some higher
    priority thing subsequently comes along and preempts the thread
    holding that lock and tries to take that lock itself, then the
    hardware can't help me out anymore, and I'm back to a
    priority-inversion deadlock that has to be dealt with by
    software.

    HW can only "take" address based locking, not data-based locking.

    Precisely. So it can't help me with priority inversion
    generally.

    The point is, even if the hardware gives me a cool
    mechanism for helping prevent priority inversion problems,
    programmers are inevitably going to have to handle those
    situations themselves anyway.

    Never promised you a rose garden.

    It's ok; I've already been through Parris Island. And OCS at
    Quantico, AFG deployment, la la la. But that's neither here nor
    there. Point is, I don't need your roses, and I probably don't
    want your threads or anything more advanced than things that let
    me build my own synchronization primitives, threads, priority
    abstractions, and so on.

    This is the sort of thing I'm trying to communicate, it's really
    not me sitting here tossing out barbs like, "oh yeah?! Well,
    how about THIS one, smart guy?"

    Look at it this way:: over the decade from 2000-2010 x86 added
    a new ATOMIC instruction every iteration.

    Using ASF or ESM; my architecture would never have to add one.
    SW has the tools to build what it wants/needs.

    Ok, now we're getting somewhere useful. That's handy.

    Look, while don't you put the documentation of this stuff
    somewhere publicly accessible? It'd probably save people like
    me a lot of time, frustration, and dad-level insults. Probably
    save you some frustration as well.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Fri Mar 28 01:47:01 2025
    Somehow the post was pushed before being edited. There seems to be
    some control sequence that finds and pushes the Post-button--this
    in not the only web site this happens to me on.

    On Fri, 28 Mar 2025 1:01:59 +0000, Dan Cross wrote:

    In article <601781e2c91d42a73526562b419fdf02@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    -----------------------

    The property that if an interrupt (or exception or SVC or SVR)
    prevents executing the event to its conclusion, HW makes the
    event look like it never started. So that when/if control returns
    we have another chance to execute the event as-if ATOMICally.

    Ok. So let's be crystal clear here: after the event concludes
    successfully, a subsequent exception, interrupt, or whatever,
    will not magically roll back any of the values set and made
    visible system-wide as a result of the successful conclusion of
    the atomic event. Correct? That's what I've been asking.

    If the esmLOCKstore() retires, the participating memory becomes visible
    at that instant and cannot be subsequently undone.

    If the esmLOCKabandon( ) retires, participating memory reverts to its
    pre-event state in that instant as if the event never happened.


    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 27 21:52:56 2025
    Ok. So let's be crystal clear here: after the event concludes
    successfully, a subsequent exception, interrupt, or whatever,
    will not magically roll back any of the values set and made
    visible system-wide as a result of the successful conclusion of
    the atomic event. Correct? That's what I've been asking.

    I don't really understand how that can be a question. It would
    completely invalidate the notion of the chunk of code being executed *atomically*.

    If a value has been made visible system-wide, it's like sending an
    email: you can't undo that.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 27 22:02:59 2025
    If the esmLOCKstore() retires, the participating memory becomes visible
    at that instant and cannot be subsequently undone.

    If the esmLOCKabandon( ) retires, participating memory reverts to its pre-event state in that instant as if the event never happened.

    Thinking about Haskell's STM (Software Transactional Memory) library:
    do you happen to offer an `esmLOCKretry` which would cause the event to
    be aborted and started over? If you do, does it start over right away
    or can it *wait* until a snoop request comes to invalidate one of the participating lines?
    [ This last functionality is how STM's `retry` works: it watches all
    the memory locations that were consulted during the failed run, knowing
    that as long as they're not modified it's pointless to run the
    transaction again because it would fail in the same way. ]


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Fri Mar 28 01:35:15 2025
    On Fri, 28 Mar 2025 1:01:59 +0000, Dan Cross wrote:

    In article <601781e2c91d42a73526562b419fdf02@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 27 Mar 2025 17:19:21 +0000, Dan Cross wrote:
    In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
    [snip]
    1. `from` may point to a cache line
    2. `to` may point to a different cache line.
    3. Does your architecture use a stack?

    No

    You make multiple references to a stack just below. I
    understand stack references are excluded from the atomic event,
    but I asked whether the architecture uses a stack. Does it, or
    doesn't it?

    What sort of alignment criteria do you impose?

    On the participating data none. On the cache line--cache line.

    I meant for data on the stack. The point is moot, though, since
    as you said stack data does not participate in the atomic event.

    Stack does not NECESSARILY participate. However, if you perform
    a esmLOCKload( SP[23] ) then that line on the stack is participating.

    Very good.

    At first blush, it seems to me that
    the pointers `from_next`, `from_prev`, and `to_next` could be
    on the stack and if so, will be on at least one cache line,
    and possibly 2, if the call frame for `MoveElement` spans
    more than one.

    The illustrated code is using 6 participating cache lines.
    Where local variables are kept (stack, registers) does not
    count against the tally of participating cache lines.

    Huh.

    There are 6 esmLOCKxxxxx() so there are 6 participating lines--
    and these are absolutely independent from where the variables
    are located. The variables carry information but do not part-
    ticipate they carry data between operations on the particupating
    lines.

    That was clearly in the code. That was not what the "Huh" was
    in resposne to.

    How would this handle something like an MCS lock, where
    the lock node may be on a stack, though accessible globally in
    some virtual address space?

    As illustrated above esmLOCKload( SP[23]) will cause a line on
    the stack to participate in the event.

    Ok, good to know.

    ----------------
    But later you wrote,

    So, if you want the property whereby the lock disappears on any >>>>>>control transfer out of the event {exception, interrupt, SVC, SVR, ...}; >>>>>>then you want to use my ATOMIC stuff; otherwise, you can use the >>>>>>normal ATOMIC primitives everyone and his brother provide.

    Well, what precisely do you mean here when you say, "if you want
    the property whereby the lock disappears on any control transfer
    out of the event"?

    If you want that property--you use the tools at hand.
    If you don't just use them as primitive generators.

    I wasn't clear enough here. I'm asking what, exactly, you mean
    by this _property_,

    The property that if an interrupt (or exception or SVC or SVR)
    prevents executing the event to its conclusion, HW makes the
    event look like it never started. So that when/if control returns
    we have another chance to execute the event as-if ATOMICally.

    Ok. So let's be crystal clear here: after the event concludes
    successfully, a subsequent exception, interrupt, or whatever,
    will not magically roll back any of the values set and made
    visible system-wide as a result of the successful conclusion of
    the atomic event. Correct? That's what I've been asking.

    not what you mean when you write that one
    can use these atomic events if one wants the property. That is,
    I'm asking you to precisely define what it means for a lock to
    "disappear".

    It seems clear enough _now_ that once the event concludes
    successfully the lock value is set to whatever it was set to in
    during the event, but that wasn't clear to me earlier and I
    wanted confirmation.

    Not "the lock value" but "all participating values" are set
    and become visible system-wide in a single instant.

    Sure. But there was only one participating value in the
    spinlock example. ;-}

    No 3rd
    party sees some set and others remain unset. Everybody sees
    all of then change value between this-clock and its successor.

    Sure. You've said this all along, and I've never disputed it.

    In particular, it seems odd to me that one would bother with a
    lock of some kind during an event if it didn't remain set after
    the event.

    Consider CompareTripleSwapFour() as an atomic primitive. How
    would you program this such that nobody nowhere could ever
    tell that the four updated locations changes in any order
    then simultaneously ??? Without using a LOCK to guard the
    event ??? And without demanding that all four updates are
    to the same cache line ???

    {{Nobody nowhere included ATOMIC-blind DMA requests}}

    The fundamental issue here is that, when I refer to a "lock" I
    have been talking about a software mutual exclusion primitive,
    and you are referring to something else entirely.

    I would have thought my meaning was clear from context;
    evidently not.

    If you do all of your critical section stuff inside
    of the event, and setting the lock in the event is not visible
    until the event concludes, why bother? If on the other hand you
    use the event to set the lock, why bother doing additional work
    inside the event itself, but in this case I definitely don't
    want something else to come along and just unset the lock on its
    own, higher priority or not.

    It is clear you do not understand the trouble HW takes to implement
    even DCADS where between any 2 clocks, one (or more) SNOOPs can
    take the lines you are DCADSing. It is this property of cache
    coherence that gets in the way of multi-operation ATOMIC events.

    As above, when I refer to "setting the lock" I am referring to
    acquiring a mutex; a software primitive. I am not referring to
    what the hardware does internally. Consider what I wrote above,
    the section that you quoted, in that context.

    In 2004 Fred Weber came to me and ask: "Why can't we (AMD) give
    Microsoft the DCADS they want. I dug into it, and ASF was my
    solution in x86 ISA, ESM is my solution in RISC-based ISA. It
    removes the need to add ATOMIC primitives to ISA over generations.

    It also provides the ability to do other things--which apparently
    you will not use--sort of like complaining that your new car has
    auto-park a feature you will never use.

    "It is clear you do not understand the trouble SW takes to
    implement even simple things that are beyond the capability of
    hardware."

    (Back at you. I think I've been pretty polite here; I'm getting
    tired of the barbs.)

    What I _hope_ you mean is that if you transfer "out of the
    event" before the `esmLOCKstore` then anything that's been done
    since the "event" started is rolled back, but NOT if control
    transfer happens _after_ the `esmLOCKstore`. If THAT were the
    case, then this entire mechanism is useless.

    One can use a esmLOCKstore( *SP, 0 ) to abandon an event. HW
    detects that *SP is not participating in the event, and uses
    the lock bit in the instruction to know it is time to leave
    the event in the failed manner. No ST is performed to the >>>>non-participating cache line. {{I guess I could wrap the
    abandonment into a macro ...}}

    Again, not what I was asking. I'm trying to understand what you
    meant when you talked about disappearing locks.

    If the event cannot be completed due to interrupt or exception or
    SCVC or SVR, the HW backs up the event such that is looks like
    it never started. SO when control return, the ATOMIC event runs
    in its entirety ATOMICally !! So, the illusion of ATOMICITY is
    preserved.

    Sigh. You've said this many times over now. I get it. I'm not
    disputing it, nor do I believe I ever have.

    Again, when I said "lock" earlier, I am referring to a mutux
    from the perspective of software. That is, the value of the
    structure member that I called "lock" in the version of the
    `Spinlock` struct that I posted however-many responses ago, and
    that I keep referring to in example code. I am not asking about
    what the hardware does internally. I'm not talking about locked
    bus cycles or whatever. I'm not referring to cache line address
    monitors or any of these other things that you keep referring
    to. I'm talking the the value of the structure member called
    `lock` in the `Spinlock` C structure I posted earlier.

    What I have been asking, over and over again, is whether that
    _value_ "disappears" _after_ an atomic event successfully
    concludes. That is, the value that was written into the memory,
    not whatever internal marker you use in hardware to monitor and
    guard those stores until the atomic event completes or is
    otherwise exited, aborted, cancelled, interrupted, or however
    you prefer to refer to it not completing successfully. That is
    a totally separate thing and I not presume to speak of how its
    done, though in fairness to you you've made it reasonably clear.

    Does that clarify what I have been asking about? I honestly
    thought this would be obvious from context.

    Again, I think you meant that, if an atomic event that sets a
    lock variable fails,

    The event has NO lock variable--it has up to 8 cache line address
    monitors.

    The structure member is literally called `lock` and the context
    is software spinlocks. Note, spinlocks because they're simple
    and easy to discuss, not because they're great mutex primtives.

    As frustrating as this evidently is to you, it strikes me that
    it actually reinforces my point: hardware has no insight into
    what the software is _doing_ at a higher semantic level.

    whether it got far enough along to set the
    variable inside the event or not is not observable if the event
    fails.

    Non-participating cache lines can be used to leak information
    out of the event (like for debug purposes). Participating data
    does not leak.

    But if the event succeeds, then the variable is set and
    that's that. IF that's the case, then great! But if not, then
    disappearing locks are bad.

    It is NOT a FRIGGEN LOCK--

    I am talking about the spinlock. That is a software object, and
    in that context, it is very much called a "lock".

    I understand when you say that it is not a lock, it is not a
    lock in the sense relevant to your world in hardware, but that's
    not what _I'm_ talking about.

    I'm trying to square what you meant when you talked about
    "disappearing locks" with what happens to those software objects
    _after_ an atomic event modifying them successfully completes.

    Evidently you were talking about something else entirely, and
    that's fine. I gather it has nothing to do with _my_ notion of
    locks after they've been successfully acquired; I was just
    trying to confirm that.

    it is a set of address monitors which
    are used to determine the success of the event -- the monitors
    remain invisible to SW just like miss buffers remain invisible
    to SW. The monitors are not preservable across privilege changes.

    See above. Tell me what you want me to call it and I'll use
    that terminology instead.

    This is especially murky to me
    because there seem to be bits of OS primitives, like threads and
    priority, that are known directly to the hardware.

    Just its instantaneous privilege level--which is congruent to
    its register file, Root Pointer, ASID, and operating modes.

    What is "it" here?

    ------------------

    Ok, so perhaps this is a reasonable implementation of spinlock
    acquire, as I wrote to Stefan yesterday:

    void
    spinlock_acquire(volatile Spinlock *lk)
    {
    while (esmLOCKload(&lk->lock) != 0)
    cpu_spin_hint();
    esmLOCKstore(&lk->lock, 1);
    }

    Yes? If this is preempted, or another CPU wins the race on the
    lock, the hardware will back this up and start over in the while
    loop, right?

    Yes.
    ------------
    Remember only the lines participating in the event count. For
    example, say one had sprintf() inside the event. None of the
    lines touched by springt() are participating in the event,
    and are not counted against the 8 cache lines available.

    This doesn't change the point, though. Anything that touches
    more than 8 cache lines is not suitable for an atomic event.

    I see what you are saying, but the event is allowed to have an
    unbounded amount of touches to non-participating lies.

    I don't see how that's helpful if I need to touch data in more
    than 8 cache lines atomically.

    Insertion into a doubly linked list is already up to 6 lines,

    I moved a doubly linked element from a CDS and inserted it
    somewhere else doubly-linked in same CDS in 6 cache lines.
    You need to refine how you count.

    <Tongue-in-cheek mode on>
    Ah, yes. The program that would dump core if you tried to move
    from the head of a list or the rear and that couldn't insert at
    the head pointer was, in fact, called `MoveElement`. My
    mistake, even if it was clear that that's what I was talking
    about, given that quoted where I referred to it in the context
    of work stealing later on. ;-P

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }

    If you're going to be pendatic when I slip up and call "move"
    "insert", then let's be pendantic.

    This code won't compile, because of the obvious error of not
    defining `tn`. Even if you fix that, it's not correct if you
    try to insert at the end of the list. It's not clear now you
    would insert at the head of a list, or into an empty list, but
    that depends on how you represent the list itself. ie, the
    head may just be a pointer to the first element, or it might be
    an otherwise unused node; something like this:

    typedef struct List List;
    struct List {
    Element head;
    };

    where the first element in the list is always `head.next`.

    It would help if you used standard types, like `bool`, which has
    been available standard for more than 25 years now since
    introduced in C99 as `_Bool` with accompanying typedefs and
    macros in `stdbool.h` for `bool`, `true` and `false`.

    And doesn't this presume that `el->next`, `el->prev` and `el`
    are all on the same cache line? Similarly with `tn` and
    `tn->prev`?

    You need to refine how you write programs. ;-}

    Here's a better version, that assumes the list representation
    above.

    #include <stddef.h>
    #include <stdbool.h>

    typedef struct Element Element;
    struct Element {
    Element *next;
    Element *prev;
    };

    typedef struct List List;
    struct List {
    Element head;
    };

    Element *
    ListBegin(List *list)
    {
    return list->head.next;
    }

    bool
    InsertElement(Element *el, Element *to)
    {
    Element *tn = esmLOCKload(to->next);
    esmLOCKprefetch(el->next);
    esmLOCKprefetch(el->prev);
    if (tn != NULL)
    esmLOCKprefetch(tn->prev);

    if (!esmINTERFERENCE()) {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore(tn->prev, el);
    return true;
    }
    return false;
    }

    Inserting at the head, or if the list is empty, are both easy
    now: `InsertElement(&list.head, el)`. But note, again, that the
    first element of the list is really `list.head.next`, hence the
    `ListBegin` function.

    Here I count up to three, but possibly up to 4, cache lines.

    Note that if you did Not represent the list this way, in order
    to support generalized insertions anywhere in the list, you'd
    either need a circular representation (which is annoying to use
    in practice), or you'd most likely have to support separate
    head and tail pointers and two operations, one to insert before,
    and another after, the 'to' pointer. But then you'd have to be
    prepared to update either the head or tail pointers (or both if
    the list is empty); either way it's potentially two more cache
    lines to account for, though if the list is empty you don't have
    to touch `tn->prev`.

    recall that my original hypothetical also involved popping off a
    queue, which adds another (the queue head). If I threw up
    incrementing a couple of counters I can't do it. And this seems
    like a perfectly reasonable real-world scenario; consider a work
    stealing thread scheduler that takes a task from one core to
    another and keeps some stats on how things move.

    Or maybe even just swapping places between two elements in a
    linked list. For example:

    The example has 3 structures each requiring 3 participating lines. >>{themself, the struct their next pointer points at and the struct
    their prev pointer points at}

    3×3 = 9 and there is no getting around it.

    Yup. So what does the hardware do if I try it?

    -------------------------

    But there isn't a lot of publicly available documentation for
    any of this, at least none that I could find, and from the way
    it's been described thus far it seems like there are a number of
    edge cases that it is not clear how one is supposed to address.
    So I want to understand the boundaries of where these hardware
    facilities are applicable, given architectural limitations. I
    think this thread has shown that those boundaries exist. The
    universe of interesting and useful software is big, and someone,
    somewhere, is going to trip over those limitations sooner or
    later. Once we accept that, we must ask, "Then what?" It seems
    obvious that programmers are going to have to find techniques to
    work around those limitations, and that the mechanisms chosen
    are unknowable to hardware.

    Fair enough.

    But now we're back in the same boat of software having to deal
    with all the thorny problems that these mechanisms were designed
    to address in the first place, like priority inversion: take the
    spinlock from earlier, the hardware has no semantic knowledge
    that the lock is used as a mutex, and can't know; it's just some
    bit in memory somewhere that was set during an atomic event.

    It is not a FRIGGEN lock, it is a set of address monitors which
    guard the instructions in the event.

    Now I have to wonder if you're deliberatly ignoring the context.

    I called it a "lock" in the same sentence I said it was a
    "spinlock". Are you seriously going to tell me that you could
    not figure out that the "lock" I was referring to was the
    spinlock? You know that that's a software thing; you yourself
    said that.

    Note the words, "it's just some bit in memory that _was set
    during an atomic event._" Is it not clear that the event is
    over? It's done. It was successful. All of the address
    monitors guarding instructions in the event are no longer
    relevant; they've been retired or whatever.

    Now, some (software) thread holds a (software) (spin)lock.
    I've tried over and over to make this clear, yet you insist on
    going back to talking about what happens _during_ an atomic
    event. I'm talking about _after_ the event completed
    _succesfully_.

    A Lock is a SW concept
    which requires a value change to announce to others that "you
    got it".

    That's exactly what `spinlock_acquire` from above _does_, isn't
    it?

    You cannot seriously expect me to accept that you did not
    understand this.

    Maybe ESM helps in a race when trying to acquire the spinlock,
    but once it's acquired, it's acquired, and because the spinlock
    is not a primitive understood by the hardware, the hardware has
    no way of incorporating the spinlock's semantics into its view
    of priority, threading, etc.

    Consider a 1024 core system and the time-quantum goes off and
    every core wants to rotate the thread it is running; putting
    it at the back of the execution queue and taking off the one on
    the front to run. Nick McLaren timed a big SUN server (100 core)
    on this and it could take up to 6 seconds !! for something that
    should take a few milliseconds. This is because of the BigO( n^3 )
    nature of bus traffic wrt SNOOPing on that system.

    This was solved with per-core run queues.

    You're senior enough that I would think it clear that spinlocks
    are just a convenient illustrative example here because they're
    so simple, but their pitfalls as core counts increase are well
    studied and understood. In the real world, mutex structures
    like MCS locks or CLH locks scale approximately linearly with
    core-count under contention. I posted a reference to this
    earlier, but here it is again. You may have a hard time
    accepting it because they call the things that they refer to,
    "locks". ;-}
    https://pdos.csail.mit.edu/papers/linux:lock.pdf

    One can use ESM to convert that particular case to BigO( 3 )
    or in the general case of random insert and removal times:
    BigO( ln( n ) ).

    Modern scheduling algorithms already run in amortized constant
    time on conventional hardware.

    So while an atomic event might help me in the implementation of
    a generic mutex of some kind, so that I can protect critical
    sections that can't be handled with the atomic event framework
    directly because of their size, that's the extent of the help it
    can give me, because once the mutex is acquired the hardware has
    no insight into the fact that it's a mutex.

    Exactly:: once you convert address-based mutual exclusion to
    data-based mutual exclusion, HW no longer sees it at all.

    That's what I've been saying all along.

    So if some higher
    priority thing subsequently comes along and preempts the thread
    holding that lock and tries to take that lock itself, then the
    hardware can't help me out anymore, and I'm back to a
    priority-inversion deadlock that has to be dealt with by
    software.

    HW can only "take" address based locking, not data-based locking.

    Precisely. So it can't help me with priority inversion
    generally.

    The point is, even if the hardware gives me a cool
    mechanism for helping prevent priority inversion problems,
    programmers are inevitably going to have to handle those
    situations themselves anyway.

    Never promised you a rose garden.

    It's ok; I've already been through Parris Island. And OCS at
    Quantico, AFG deployment, la la la. But that's neither here nor
    there. Point is, I don't need your roses, and I probably don't
    want your threads or anything more advanced than things that let
    me build my own synchronization primitives, threads, priority
    abstractions, and so on.

    This is the sort of thing I'm trying to communicate, it's really
    not me sitting here tossing out barbs like, "oh yeah?! Well,
    how about THIS one, smart guy?"

    Look at it this way:: over the decade from 2000-2010 x86 added
    a new ATOMIC instruction every iteration.

    Using ASF or ESM; my architecture would never have to add one.
    SW has the tools to build what it wants/needs.

    Ok, now we're getting somewhere useful. That's handy.

    Look, while don't you put the documentation of this stuff
    somewhere publicly accessible? It'd probably save people like
    me a lot of time, frustration, and dad-level insults. Probably
    save you some frustration as well.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Fri Mar 28 02:56:26 2025
    In article <7b53bc22cfbacd27f9e735cc42ce908c@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    Somehow the post was pushed before being edited. There seems to be
    some control sequence that finds and pushes the Post-button--this
    in not the only web site this happens to me on.

    Clearly it wasn't part of some ESM atomic event.

    (Jokes aside, it happens. No worries.)

    On Fri, 28 Mar 2025 1:01:59 +0000, Dan Cross wrote:

    In article <601781e2c91d42a73526562b419fdf02@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    -----------------------

    The property that if an interrupt (or exception or SVC or SVR)
    prevents executing the event to its conclusion, HW makes the
    event look like it never started. So that when/if control returns
    we have another chance to execute the event as-if ATOMICally.

    Ok. So let's be crystal clear here: after the event concludes
    successfully, a subsequent exception, interrupt, or whatever,
    will not magically roll back any of the values set and made
    visible system-wide as a result of the successful conclusion of
    the atomic event. Correct? That's what I've been asking.

    If the esmLOCKstore() retires, the participating memory becomes visible
    at that instant and cannot be subsequently undone.

    Ok. That's what I was asking.

    If the esmLOCKabandon( ) retires, participating memory reverts to its >pre-event state in that instant as if the event never happened.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to monnier@iro.umontreal.ca on Fri Mar 28 02:58:29 2025
    In article <jwv7c49j540.fsf-monnier+comp.arch@gnu.org>,
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Ok. So let's be crystal clear here: after the event concludes
    successfully, a subsequent exception, interrupt, or whatever,
    will not magically roll back any of the values set and made
    visible system-wide as a result of the successful conclusion of
    the atomic event. Correct? That's what I've been asking.

    I don't really understand how that can be a question. It would
    completely invalidate the notion of the chunk of code being executed >*atomically*.

    If a value has been made visible system-wide, it's like sending an
    email: you can't undo that.

    Yup, it sure would. But when Mitch was talking about locks
    disappearing, it wasn't clear (to me anyway) that he was talking
    about something in the hardware, and not the mutexes I was
    trying to build in software; hence why I was asking.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Fri Mar 28 02:53:57 2025
    In article <34434320650f5844b18b1c0b684acf43@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 27 Mar 2025 17:19:21 +0000, Dan Cross wrote:
    In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    -------------------
    Or maybe even just swapping places between two elements in a
    linked list. For example:

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    esmLOCKprefetch(*head);

    This should be a load not prefetch--you want the value of *head

    I wondered about that. I ASSumed that `esmLOCKprefetch` was
    enough to set up a cache line address monitor on `*head`, but if
    you tell me I need `esmLOCKload`, then I need `esmLOCKload`:

    hp = esmLOCKload(*head);
    if (hp == a)
    *head = b; // Note I need to set whatever head points to, not hp.
    else if (hp == b)
    *head = a;

    (You can see a vestige of me trying that earlier by the `hp`
    temporary that I left in when I posted that initially, but I
    didn't like how it looked, so I changed it to what was posted.)

    if (*head == a) // see !
    *head = b;
    else if (*head == b)
    *head = a;

    There is a ESM rule that states:: all participating cache lines
    must be touched before any participating cache lines can be
    modified.

    Are these rules in a reference somewhere I can see? I kinda
    reckoned `esmLOCKprefetch` would be enough to "touch" the line,
    but I fully admit I was just guessing. Looks like `prefetch` is
    mostly for the case where you'll store, but don't need to read?

    Also note: Participating cache lines are checked for write permission
    at touch time, and on cache miss, read with intent to modify.

    Ok.

    an = esmLOCKload(a->next);
    ap = esmLOCKload(a->prev);
    bn = esmLOCKload(b->next);
    bp = esmLOCKload(b->prev);

    b->next = an;
    if (an != NULL) {
    esmLOCKprefetch(an->prev);
    an->prev = b;
    }
    b->prev = ap;
    if (ap != NULL) {
    esmLOCKprefetch(ap->next);
    ap->next = b;
    }

    a->next = bn;
    if (bn != NULL) {
    esmLOCKprefetch(bn->prev);
    bn->prev = a;
    }
    if (bp != NULL) {
    esmLOCKprefetch(bp->next);
    bp->next = a;
    }
    esmLOCKstore(a->prev, bp);
    }

    What I think you want:: (ignoring the 9 participants limit)

    You mean 8 participants limit? That counting thing again. ;-}

    What I really wanted was an example that exceeded that limit as
    an expository vehicle for understanding what happens when the
    limit is exceeded. What does the hardware do in that case?

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    top_of_ATOMIC_event:

    // this is the recovery point is you don't use esmINTERFERENCE()
    // the very next instruction begins the event.

    Yup. Tracking.

    esmLOCKprefetch( an = esmLOCKload(a->next) );
    esmLOCKprefetch( ap = esmLOCKload(a->prev) );
    esmLOCKprefetch( bn = esmLOCKload(b->next) );
    esmLOCKprefetch( bp = esmLOCKload(b->prev) );

    What exactly does this do? This "touches" whatever lines the
    things that `an`, `ap`, `bn`, and `bp` are all in, as well as
    the lines that the `a` and `b` prev and next pointers point
    into as well, right?

    Node *Ehead = esmLOCKload(*head);

    You shoulda just used the `hp` local I left for you for exactly
    this. :-D

    // by placing all the the touching before any manifestation, you put
    // all the touch latency* in the code before it has tried to damage any
    // participating memory location. (*) and TLB latency and 2nd party
    // observation of your event.

    // this would be the point where you would insert if( esmINTERFERENCE(
    ))
    // if you wanted control at a known failure point rather than at the
    // top of the event on failure.

    if (Ehead == a)
    *head = b;
    else if (Ehead == b)
    *head = a;

    b->next = an;
    if (an != NULL) {
    an->prev = b;
    }
    b->prev = ap;
    if (ap != NULL) {
    ap->next = b;
    }

    a->next = bn;
    if (bn != NULL) {
    bn->prev = a;
    }
    if (bp != NULL) {
    bp->next = a;
    }
    esmLOCKstore(a->prev, bp);
    }

    // now manifestation has lowest possible latency (as seen by this core
    alone)

    Doesn't this code now assume that `an->prev` is in the same
    cache line as `an`, and similarly that `bn` is in the same line
    as `bn->prev`? Absent an alignment constraint on the `Node`
    type, that's not guaranteed given the definition posted earlier.

    That may be an odd and ill-advised thing for a programmer to do
    if they want their list type to work with atomic events, but it
    is possible.

    The node pointers and their respective `next` pointers are okay,
    so I wonder if perhaps this might have been written as:

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    hp = esmLOCKload(*head);
    esmLOCKprefetch(an = esmLOCKload(a->next));
    ap = esmLOCKload(a->prev);
    esmLOCKprefetch(bn = esmLOCKload(b->next));
    bp = esmLOCKload(b->prev);

    if (an != NULL)
    esmLOCKprefetch(an->prev);
    if (bn != NULL) {
    esmLOCKprefetch(bn->prev);
    bn->prev = a;
    }

    if (hp == a)
    *head = b;
    else if (hp == b)
    *head = a;

    b->next = an;
    if (an != NULL)
    an->prev = b;
    b->prev = ap;
    if (ap != NULL)
    ap->next = b;

    a->next = bn;
    if (bp != NULL)
    bp->next = a;

    esmLOCKstore(a->prev, bp);
    }

    But now the conditional testing whether or not `an` is `NULL` is
    repeated. Is the additional branch overhead worth it here?

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Mar 14 14:52:47 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 23:34:22 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 21:14:08 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:


    Most modern devices advertise the MSI-X capability instead.

    And why not:: its just a few more flip-flops and almost no more >>>>>sequencing logic.

    I've seen several devices with more than 200 MSI-X vectors;
    thats 96 bits per vector to store a full 64-bit address and
    32-bit data payload.

    At this point, with 200+ entries, flip-flops are not recommended,
    instead these would be placed in a RAM of some sort. Since RAMs
    come in 1KB and 2KB quanta; we have 1 of 2K and 1 of 1K and we
    have 256 said message containers, with 1 cycle access (after
    you get to that corner of some chip).

    That is 200 _per function_. Consider a physical function that
    supports the SRIOV capability and configures 2048 virtual
    functions. So that's 2049 * 200 MSI-X vectors just for one
    device. It's not unusual. These vectors are, of course,
    stored on the device itself. Mostly in RAMs.

    In My 66000, said device can place those value-holding containers
    in actual DRAM should it want to punt the storage.

    Any device can do that today if it is designed to do so. But that
    requires an additional DMA operation to send an MSI-X interrupt;
    the device must first read the address and data fields from
    host dram (as configured by the device driver) before building
    the inbound memory write TLP that gets sent from the device to
    the root port.

    Just adds latency and requires the driver to allocate space and
    tell the device the base address of the array of vectors.


    This adds latency
    and decreases on-die storage. At that point, the device has an
    unlimited number of "special things".

    Most PCIe devices are third party IP, which you must live with.

    Your on-chip PCIe-like devices can do as they wish, unless they're
    standard IP such as a Synopsis SATA controller, or third party
    network controller PCIe endpoint IP.

    I will just note that the high-end SoCs with on-chip PCIe-like
    devices that I'm familiar with all use SRAMs for the MSI-X vectors
    (unless there is only a couple, in which case flops work).

    The latency overhead of fetching the vector from DRAM is
    prohibitive for high-speed devices such as network controllers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Mar 14 15:34:53 2025
    On Fri, 14 Mar 2025 14:52:47 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 23:34:22 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 21:14:08 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:


    Most modern devices advertise the MSI-X capability instead.

    And why not:: its just a few more flip-flops and almost no more >>>>>>sequencing logic.

    I've seen several devices with more than 200 MSI-X vectors;
    thats 96 bits per vector to store a full 64-bit address and
    32-bit data payload.

    At this point, with 200+ entries, flip-flops are not recommended, >>>>instead these would be placed in a RAM of some sort. Since RAMs
    come in 1KB and 2KB quanta; we have 1 of 2K and 1 of 1K and we
    have 256 said message containers, with 1 cycle access (after
    you get to that corner of some chip).

    That is 200 _per function_. Consider a physical function that
    supports the SRIOV capability and configures 2048 virtual
    functions. So that's 2049 * 200 MSI-X vectors just for one
    device. It's not unusual. These vectors are, of course,
    stored on the device itself. Mostly in RAMs.

    We CPU guys deal with dozens of cores, each having 2×64KB L1 caches,
    a 256KB-1024KB L2 cache, and have that dozen cores share a 16MB L3
    cache. This means the chip contains 26,624 1KB SRAM macros.

    In the above you are complaining that the I/O device can only afford
    a few of these, whereas we CPU guys count then in the thousands
    (and approaching 100's of thousands.)

    You can forgive me when the device guys shudder at the thought of
    a dozen 1K SRAM macros as "expensive".....and I don't see their
    plight immediately.


    In My 66000, said device can place those value-holding containers
    in actual DRAM should it want to punt the storage.

    Any device can do that today if it is designed to do so. But that
    requires an additional DMA operation to send an MSI-X interrupt;
    the device must first read the address and data fields from
    host dram (as configured by the device driver) before building
    the inbound memory write TLP that gets sent from the device to
    the root port.

    Was thinking about this last night::
    a) device goes up and reads DRAM via L3::MC and DRC
    b) DRAM data is delivered to device 15ns later
    c) device uses data to send MSI-X message to interrupt 'controller'
    d) interrupt controller in L3 sees interrupt

    {to hand waving accuracy}
    So, we have dozen ns up the PCIe tree, dozen ns over the interconnect,
    50ns in DRAM, dozens ns over the interconnect, dozens of ns down the
    PCIe tree, 1ns at device, dozen ns up the PCIe tree, dozens across interconnect, arriving at interrupt service port after 122 ns or
    about the equivalent of 600± clocks to log the interrupt into the
    table.

    The Priority broadcast is going to take another dozen ns, core
    request for interrupt will be another dozen to service controller,
    even if the service port request is serviced instantaneously,
    the MSI-X message does not arrive at core until 72ns after arriving
    at service port--for a best case latency on the order of 200 ns
    (or 1000 CPU cycles or ~ 2,000 instructions worth of execution.)

    And that is under the assumption that no traffic interference
    is encountered up or down the PCIe trees.

    whereas::

    if the device DRAM read request was known to contain an MSI-X
    message, then a trip up and down the PCIe tree could be completely
    eliminated (in terms of interrupt latency)--we still have the
    4 round trip latency of the interconnect between service port
    and core.

    Just adds latency and requires the driver to allocate space and
    tell the device the base address of the array of vectors.


    This adds latency
    and decreases on-die storage. At that point, the device has an
    unlimited number of "special things".

    Most PCIe devices are third party IP, which you must live with.

    Captain Obvious strikes again.

    Your on-chip PCIe-like devices can do as they wish, unless they're
    standard IP such as a Synopsis SATA controller, or third party
    network controller PCIe endpoint IP.

    I will just note that the high-end SoCs with on-chip PCIe-like
    devices that I'm familiar with all use SRAMs for the MSI-X vectors
    (unless there is only a couple, in which case flops work).

    That was my originating assumption that you spent the middle of this
    thread dissuading ...

    The latency overhead of fetching the vector from DRAM is
    prohibitive for high-speed devices such as network controllers.

    Here we have the situation where one can context switch in a lower
    number of clock cycles than one can deliver an interrupt from
    a device to a servicing core.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Mar 14 17:35:23 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 14 Mar 2025 14:52:47 +0000, Scott Lurndal wrote:



    We CPU guys deal with dozens of cores, each having 2×64KB L1 caches,
    a 256KB-1024KB L2 cache, and have that dozen cores share a 16MB L3
    cache. This means the chip contains 26,624 1KB SRAM macros.

    You've a lot more area to work with, and generally a more
    recent process node.



    Was thinking about this last night::
    a) device goes up and reads DRAM via L3::MC and DRC
    b) DRAM data is delivered to device 15ns later

    15ns? That's optimistic and presumes a cache hit, right?

    Don't forget to factor in PCIe latency (bus to RC and RC to endpoint).

    c) device uses data to send MSI-X message to interrupt 'controller'
    d) interrupt controller in L3 sees interrupt

    {to hand waving accuracy}
    So, we have dozen ns up the PCIe tree, dozen ns over the interconnect,
    50ns in DRAM, dozens ns over the interconnect, dozens of ns down the
    PCIe tree, 1ns at device, dozen ns up the PCIe tree, dozens across >interconnect, arriving at interrupt service port after 122 ns or
    about the equivalent of 600± clocks to log the interrupt into the
    table.

    The Priority broadcast is going to take another dozen ns, core
    request for interrupt will be another dozen to service controller,
    even if the service port request is serviced instantaneously,
    the MSI-X message does not arrive at core until 72ns after arriving
    at service port--for a best case latency on the order of 200 ns
    (or 1000 CPU cycles or ~ 2,000 instructions worth of execution.)

    And that is under the assumption that no traffic interference
    is encountered up or down the PCIe trees.

    whereas::

    if the device DRAM read request was known to contain an MSI-X
    message,

    You can't know that a priori, it's just another memory write
    (or read if you need to fetch the address and data from DRAM)
    TLP as part of the inbound DMA. Which needs to hit the IOMMU
    first to translate the PCI memory space address to the host
    physical address space address.

    If the MSI-X tables were kept in DRAM, you also need to include
    the IOMMU translation latency in the inbound path that fetches
    the vector address and vector data (96-bits, so that's two
    round trips from the device to memory). For a virtual function,
    the MSI-X table is owned and managed by the guest, and all
    transaction addresses from the device must be translated from
    guest physical addresses to host physical addresses.

    A miss in the IOMMU adds a _lot_ of latency to the request.

    So, that's three round trips from the device to the
    Uncore/RoC just to send a single interrupt from the device.


    The latency overhead of fetching the vector from DRAM is
    prohibitive for high-speed devices such as network controllers.

    Here we have the situation where one can context switch in a lower
    number of clock cycles than one can deliver an interrupt from
    a device to a servicing core.

    Device needs to send an interrupt when vectors stored in host DRAM
    instead of internal SRAM or flops:

    - send non-posted MRD TLP to RC to fetch MSI-X address
    - receiver (pcie controller (RC), for example) passes
    MRD address to IOMMU for translation (assuming
    the device and host don't implement ATS),
    IOMMU translates (table walk latency) the
    address from the TLP to a host physical
    address (which could involve two levels of
    translation, so up to 22 DRAM accesses (intel/amd/aarch64)
    on IOMMU TLB miss). The latency is dependent
    up on the IOMMU table format - Intel has EPT
    while ARM and AMD use the same format as the CPU
    page tables for the IOMMU tables.
    (this leaves out any further latency hit when
    using the PCI Page Request Interface (PRI) to make
    the target page resident).
    - LLC/DRAM satisfies the MRD and returns data to
    PCIe controller, which sends a completion TLP
    to device. LLC (minimum), DRAM (maximum) latency added.
    - RC/host sends response with address to device
    - Device sends non-posted MRD TLP to RC to fetch MSI-X Data
    (32-bit). Again with the IOMMU, but will likely
    hit TLB. Lesser latency than a miss, but nonzero.
    - RC returns completion TLP to device.
    - Device sends MWR TLP (data payload) with the translated
    address to the root complex, which passes it to the
    internal bus structure for routing to the final address
    (interrupt controller).
    - Then add the latency from the interrupt controller to
    the target core (which may include making the target guest
    resident).

    That's a whole pile of latency to send an interrupt.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Mar 14 18:42:18 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 14 Mar 2025 17:35:23 +0000, Scott Lurndal wrote:


    Here we have the situation where one can context switch in a lower
    number of clock cycles than one can deliver an interrupt from
    a device to a servicing core.

    Device needs to send an interrupt when vectors stored in host DRAM
    instead of internal SRAM or flops:

    - send non-posted MRD TLP to RC to fetch MSI-X address
    - receiver (pcie controller (RC), for example) passes
    MRD address to IOMMU for translation (assuming
    the device and host don't implement ATS),
    IOMMU translates (table walk latency) the
    address from the TLP to a host physical
    address (which could involve two levels of
    translation, so up to 22 DRAM accesses (intel/amd/aarch64)
    on IOMMU TLB miss). The latency is dependent
    up on the IOMMU table format - Intel has EPT
    while ARM and AMD use the same format as the CPU
    page tables for the IOMMU tables.
    (this leaves out any further latency hit when
    using the PCI Page Request Interface (PRI) to make
    the target page resident).
    - LLC/DRAM satisfies the MRD and returns data to
    PCIe controller, which sends a completion TLP
    to device. LLC (minimum), DRAM (maximum) latency added.
    - RC/host sends response with address to device
    - Device sends non-posted MRD TLP to RC to fetch MSI-X Data
    (32-bit). Again with the IOMMU, but will likely
    hit TLB. Lesser latency than a miss, but nonzero.
    - RC returns completion TLP to device.
    - Device sends MWR TLP (data payload) with the translated
    address to the root complex, which passes it to the
    internal bus structure for routing to the final address
    (interrupt controller).
    - Then add the latency from the interrupt controller to
    the target core (which may include making the target guest
    resident).

    That's a whole pile of latency to send an interrupt.

    I bet the MSI-X messages would cache on the device rather well ...
    as they change roughly at the rate of VM creation.

    If the device supports ATS, yes, it can cache the
    translations (ATS provides a mechanism for the host
    to invalidate cache entries on the device).

    I haven't seen many that do support ATS, however, most
    rely on the host IOMMU to cache the translations, rather
    than caching them on the device (so they don't have to
    implement the invalidation protocols inherent in ATS).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Mar 14 18:12:23 2025
    On Fri, 14 Mar 2025 17:35:23 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 14 Mar 2025 14:52:47 +0000, Scott Lurndal wrote:



    We CPU guys deal with dozens of cores, each having 2×64KB L1 caches,
    a 256KB-1024KB L2 cache, and have that dozen cores share a 16MB L3
    cache. This means the chip contains 26,624 1KB SRAM macros.

    You've a lot more area to work with, and generally a more
    recent process node.



    Was thinking about this last night::
    a) device goes up and reads DRAM via L3::MC and DRC
    b) DRAM data is delivered to device 15ns later

    15ns? That's optimistic and presumes a cache hit, right?

    See the paragraph below {to hand waving accuracy} for a more
    reasonable guestimate of 122ns from device back to device
    just reading the MSI-X message and address.

    Don't forget to factor in PCIe latency (bus to RC and RC to endpoint).

    c) device uses data to send MSI-X message to interrupt 'controller'
    d) interrupt controller in L3 sees interrupt

    {to hand waving accuracy}
    So, we have dozen ns up the PCIe tree, dozen ns over the interconnect,
    50ns in DRAM, dozens ns over the interconnect, dozens of ns down the
    PCIe tree, 1ns at device, dozen ns up the PCIe tree, dozens across >>interconnect, arriving at interrupt service port after 122 ns or
    about the equivalent of 600± clocks to log the interrupt into the
    table.

    The Priority broadcast is going to take another dozen ns, core
    request for interrupt will be another dozen to service controller,
    even if the service port request is serviced instantaneously,
    the MSI-X message does not arrive at core until 72ns after arriving
    at service port--for a best case latency on the order of 200 ns
    (or 1000 CPU cycles or ~ 2,000 instructions worth of execution.)

    And that is under the assumption that no traffic interference
    is encountered up or down the PCIe trees.

    whereas::

    if the device DRAM read request was known to contain an MSI-X
    message,

    You can't know that a priori,

    yes, I know that:: but if you c o u l d . . . you could save roughly
    1/2 of the interrupt delivery to core latency.

    it's just another memory write
    (or read if you need to fetch the address and data from DRAM)
    TLP as part of the inbound DMA. Which needs to hit the IOMMU
    first to translate the PCI memory space address to the host
    physical address space address.

    If the MSI-X tables were kept in DRAM, you also need to include
    the IOMMU translation latency in the inbound path that fetches
    the vector address and vector data (96-bits, so that's two
    round trips from the device to memory). For a virtual function,
    the MSI-X table is owned and managed by the guest, and all
    transaction addresses from the device must be translated from
    guest physical addresses to host physical addresses.

    A miss in the IOMMU adds a _lot_ of latency to the request.

    So, that's three round trips from the device to the
    Uncore/RoC just to send a single interrupt from the device.

    3 dozen-ns traversals, not counting actual memory access time.
    Then another dozen ns traversal and enQueueing in the interrupt
    table. Then 3 round dozen-ns trips on the on-die interconnect.

    It all adds up.


    The latency overhead of fetching the vector from DRAM is
    prohibitive for high-speed devices such as network controllers.

    Here we have the situation where one can context switch in a lower
    number of clock cycles than one can deliver an interrupt from
    a device to a servicing core.

    Device needs to send an interrupt when vectors stored in host DRAM
    instead of internal SRAM or flops:

    - send non-posted MRD TLP to RC to fetch MSI-X address
    - receiver (pcie controller (RC), for example) passes
    MRD address to IOMMU for translation (assuming
    the device and host don't implement ATS),
    IOMMU translates (table walk latency) the
    address from the TLP to a host physical
    address (which could involve two levels of
    translation, so up to 22 DRAM accesses (intel/amd/aarch64)
    on IOMMU TLB miss). The latency is dependent
    up on the IOMMU table format - Intel has EPT
    while ARM and AMD use the same format as the CPU
    page tables for the IOMMU tables.
    (this leaves out any further latency hit when
    using the PCI Page Request Interface (PRI) to make
    the target page resident).
    - LLC/DRAM satisfies the MRD and returns data to
    PCIe controller, which sends a completion TLP
    to device. LLC (minimum), DRAM (maximum) latency added.
    - RC/host sends response with address to device
    - Device sends non-posted MRD TLP to RC to fetch MSI-X Data
    (32-bit). Again with the IOMMU, but will likely
    hit TLB. Lesser latency than a miss, but nonzero.
    - RC returns completion TLP to device.
    - Device sends MWR TLP (data payload) with the translated
    address to the root complex, which passes it to the
    internal bus structure for routing to the final address
    (interrupt controller).
    - Then add the latency from the interrupt controller to
    the target core (which may include making the target guest
    resident).

    That's a whole pile of latency to send an interrupt.

    I bet the MSI-X messages would cache on the device rather well ...
    as they change roughly at the rate of VM creation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Mar 28 11:41:34 2025
    Generally when I write queue-code, I use a dummy Node front/rear
    such that the checks for Null are unnecessary (at the cost of
    following 1 more ->next or ->prev). That is Q->head and Q->tail
    are never NULL and when the queue is empty there is a Node which
    carries the fact the queue is empty (not using NULL as a pointer).

    It's called a "sentinel".


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Fri Mar 28 15:24:50 2025
    On Fri, 28 Mar 2025 2:53:57 +0000, Dan Cross wrote:

    In article <34434320650f5844b18b1c0b684acf43@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 27 Mar 2025 17:19:21 +0000, Dan Cross wrote:
    In article <7a093bbb356e3bda3782c15ca27e98a7@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    -------------------
    Or maybe even just swapping places between two elements in a
    linked list. For example:

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    esmLOCKprefetch(*head);

    This should be a load not prefetch--you want the value of *head

    I wondered about that. I ASSumed that `esmLOCKprefetch` was
    enough to set up a cache line address monitor on `*head`, but if
    you tell me I need `esmLOCKload`, then I need `esmLOCKload`:

    hp = esmLOCKload(*head);
    if (hp == a)
    *head = b; // Note I need to set whatever head points to, not hp.
    else if (hp == b)
    *head = a;

    (You can see a vestige of me trying that earlier by the `hp`
    temporary that I left in when I posted that initially, but I
    didn't like how it looked, so I changed it to what was posted.)

    if (*head == a) // see !
    *head = b;
    else if (*head == b)
    *head = a;

    There is a ESM rule that states:: all participating cache lines
    must be touched before any participating cache lines can be
    modified.

    Are these rules in a reference somewhere I can see? I kinda
    reckoned `esmLOCKprefetch` would be enough to "touch" the line,
    but I fully admit I was just guessing. Looks like `prefetch` is
    mostly for the case where you'll store, but don't need to read?

    Also note: Participating cache lines are checked for write permission
    at touch time, and on cache miss, read with intent to modify.

    Ok.

    an = esmLOCKload(a->next);
    ap = esmLOCKload(a->prev);
    bn = esmLOCKload(b->next);
    bp = esmLOCKload(b->prev);

    b->next = an;
    if (an != NULL) {
    esmLOCKprefetch(an->prev);
    an->prev = b;
    }
    b->prev = ap;
    if (ap != NULL) {
    esmLOCKprefetch(ap->next);
    ap->next = b;
    }

    a->next = bn;
    if (bn != NULL) {
    esmLOCKprefetch(bn->prev);
    bn->prev = a;
    }
    if (bp != NULL) {
    esmLOCKprefetch(bp->next);
    bp->next = a;
    }
    esmLOCKstore(a->prev, bp);
    }

    What I think you want:: (ignoring the 9 participants limit)

    You mean 8 participants limit? That counting thing again. ;-}

    What I really wanted was an example that exceeded that limit as
    an expository vehicle for understanding what happens when the
    limit is exceeded. What does the hardware do in that case?

    Raise OPERATION (instruction) Fault.

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    top_of_ATOMIC_event:

    // this is the recovery point is you don't use esmINTERFERENCE()
    // the very next instruction begins the event.

    Yup. Tracking.

    esmLOCKprefetch( an = esmLOCKload(a->next) );
    esmLOCKprefetch( ap = esmLOCKload(a->prev) );
    esmLOCKprefetch( bn = esmLOCKload(b->next) );
    esmLOCKprefetch( bp = esmLOCKload(b->prev) );

    What exactly does this do? This "touches" whatever lines the
    things that `an`, `ap`, `bn`, and `bp` are all in, as well as
    the lines that the `a` and `b` prev and next pointers point
    into as well, right?

    Node *Ehead = esmLOCKload(*head);

    You shoulda just used the `hp` local I left for you for exactly
    this. :-D

    // by placing all the the touching before any manifestation, you put
    // all the touch latency* in the code before it has tried to damage any
    // participating memory location. (*) and TLB latency and 2nd party
    // observation of your event.

    // this would be the point where you would insert if( esmINTERFERENCE(
    ))
    // if you wanted control at a known failure point rather than at the
    // top of the event on failure.

    if (Ehead == a)
    *head = b;
    else if (Ehead == b)
    *head = a;

    b->next = an;
    if (an != NULL) {
    an->prev = b;
    }
    b->prev = ap;
    if (ap != NULL) {
    ap->next = b;
    }

    a->next = bn;
    if (bn != NULL) {
    bn->prev = a;
    }
    if (bp != NULL) {
    bp->next = a;
    }
    esmLOCKstore(a->prev, bp);
    }

    // now manifestation has lowest possible latency (as seen by this core >>alone)

    Doesn't this code now assume that `an->prev` is in the same
    cache line as `an`, and similarly that `bn` is in the same line
    as `bn->prev`? Absent an alignment constraint on the `Node`
    type, that's not guaranteed given the definition posted earlier.

    Everything you state is true, I just tried to move the esmLOCKxxxx
    up to the setup phase to obey ESM rules.

    That may be an odd and ill-advised thing for a programmer to do
    if they want their list type to work with atomic events, but it
    is possible.

    The node pointers and their respective `next` pointers are okay,
    so I wonder if perhaps this might have been written as:

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    hp = esmLOCKload(*head);
    esmLOCKprefetch(an = esmLOCKload(a->next));
    ap = esmLOCKload(a->prev);
    esmLOCKprefetch(bn = esmLOCKload(b->next));
    bp = esmLOCKload(b->prev);

    if (an != NULL) // I see what you did
    esmLOCKprefetch(an->prev);
    if (bn != NULL) {
    esmLOCKprefetch(bn->prev);
    bn->prev = a;
    }

    if (hp == a)
    *head = b;
    else if (hp == b)
    *head = a;

    b->next = an;
    if (an != NULL)
    an->prev = b;
    b->prev = ap;
    if (ap != NULL)
    ap->next = b;
    // illustrative code
    a->next = bn; // ST Rbp,[Ra,#next]
    if (bp != NULL) // PNE0 Rbp,T
    bp->next = a; // ST Ra,[Rbp,#next]

    esmLOCKstore(a->prev, bp);
    }

    But now the conditional testing whether or not `an` is `NULL` is
    repeated. Is the additional branch overhead worth it here?

    In My 66000 ISA, a compare against zero (or NULL) is just a branch
    instruction, so the CMP zero is performed twice, but each use is
    but a single Branch-on-condition instruction (or Predicate-on-
    Condition instruction).

    In the case of using predicates, FETCH/DECODE will simple issue
    both then and else clauses into the execution window (else-clause
    is empty here) and let the reservation stations handle execution
    order. And the condition latency is purely the register dependence
    chain. A 6-wide machine should have no trouble in inserting two
    copies of the code commented by "illustrative code" above--in
    this case 6-instructions or 2 sets of {ST, PNE0, ST}.

    In the case of using a real branch, latency per use is likely to
    be 2-cycles, moderated by typical branch prediction. The condition
    will have resolved early, so we are simply FETCH/DECODE/TAKE bound.

    {{That is: PRED should be faster in almost all conceivable cases.}}

    - Dan C.

    Generally when I write queue-code, I use a dummy Node front/rear
    such that the checks for Null are unnecessary (at the cost of
    following 1 more ->next or ->prev). That is Q->head and Q->tail
    are never NULL and when the queue is empty there is a Node which
    carries the fact the queue is empty (not using NULL as a pointer).

    But that is just my style.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Dan Cross on Fri Mar 28 14:40:42 2025
    Dan Cross wrote:
    In article <8uzEP.115550$Xq5f.66883@fx38.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <fe9715fa347144df1e584463375107cf@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 20 Mar 2025 12:44:08 +0000, Dan Cross wrote:
    Sometimes you really don't want to be interrupted.
    And sometimes you don't want to be interrupted unless the
    "house is on fire"; I cannot see a time when "the house is
    on fire" that you would not want to take the interrupt.

    Is there one ?!?
    Consider a thread that takes a spinlock; suppose some
    high-priority interrupt comes in while the thread is holding
    that lock. In response to the interrupt, software decides to
    suspend the thread and switch some other thread; that thread
    wants to lock the spin lock that the now-descheduled thread is
    holding: a classic deadlock scenario.
    Terminology: mutexes coordinate mutual exclusion between threads,
    spinlocks coordinate mutual exclusion between cpu cores.
    Windows "critical sections" are mutexes with a fast path.
    A spin lock is simply any lock where you spin trying to acquire
    the lock, as opposed to a blocking synchronization protocol.
    Here I'm using the terminology of Herlihy and Shavit [Her08].
    I'm using terminology in common use on multiprocessor systems like
    VMS since 1980's and later WinNT.

    Each OS data structure has different reentrancy requirements.
    - Ones accessed by threads are guarded by mutexes.
    Example on Windows would be the page tables as paging is done by threads. >> - Ones accessed by OS software interrupt level are guarded by spinlocks.
    Example on Windows is the Scheduler tables.
    - Ones accessed by HW interrupts are guarded by interrupt spinlocks.
    Example on Windows is device interrupt service routines.

    I'm not sure what you're saying here, or rather, how it is
    relevant.

    Your earlier statement was that "mutexes coordinate mutual
    exclusion between threads, spinlocks coordinate mutual exclusion
    between cpu cores." But this is not the case.

    Taking VMS as an example, from [Gol94], sec 9.5 (page 255):

    |A thread of execution running on one processor acquires a
    |spinlock to serialize access to the data with threads of
    |execution running on other processors. Before acquiring
    |the spinlock, the thread of executions raises IPL to block
    |accesses by other threads of execution running on the same
    |processor. The IPL value is determined by the spinlock being
    |locked.

    Goldberg and Saravanan seems quite explicit that threads can
    acquire and hold spinlocks on OpenVMS, which seems to be
    something that you were objecting to?

    It took me a while to find copy of that book online.
    That 1994 book relates to VMS before they added kernel threads to the OS.
    At that point VMS scheduled processes, an execution context plus virtual
    space, like most *nix (except I think Solaris which had kernel threads).

    It says on pg 3: "The term thread of execution as used in this book
    refers to a computational entity, an agent of execution."
    This refers generally to the various code sequences from different context
    that may be intertwined on a processors core for interrupts, exceptions,
    Dpc (called driver forks in VMS), process AST's (software interrupts),
    and applications.

    In a later book on VMS after they added kernel scheduled threads they consistently use the two terms "thread of execution" and "kernel thread"
    to distinguish the two meanings.

    I find the phrase "thread of execution" ambiguous and try not use it.
    After the term thread had been pretty much adopted to mean kernel thread
    as an OS scheduled execution context, I would have preferred if they used something like "strand of execution" to avoid the overloading.

    Turning to Windows, [Rus12] says about spin locks on page 179:

    |Testing and acquiring the lock in one instruction prevents a
    |second thread from grabbing the lock between the time the first
    |thread tests the variable and the time it acquires the lock.

    Note that, again, the language here is thread-centric.

    He says on pg 12: "A thread is the entity within a process that
    Windows schedules for execution." meaning "kernel thread".

    He also uses the term thread loosely as you can see on pg 13 when he says: "Fibers allow an application to schedule its own ôthreadsö of execution"

    He also says pg 87 (to paraphrase) that scheduling priority is an
    attribute of a thread whereas IRQL is an attribute of a processor.
    Which is the distinction I am trying to make: when you raise IRQL
    it stops being a thread executing and starts being a processor core.

    By the way, regarding terminology, on page 177 [Rus12] also
    says,

    |Sections of code that access a nonshareable resource are called
    |_critical_ sections. To ensure correct code, only one thread
    |at a time can execute in a critical section.

    This is the generic definition of a critical section and note
    that this is the context in which we are using the term
    throughout this thread.

    Yes, I was trying to note the ambiguous terminology so we could avoid it.

    That page is referring to kernel threads as scheduled entities
    because he says "if the operating system performed a context
    switch to the second thread...". In a Windows context the term
    "critical section" means a specific Win32 user mode mechanism
    used to synchronize access for scheduled threads.

    As opposed to books on multi-processors often use the term
    "critical section" in a generic way to mean a shared section of memory
    accessed by multiple processors, guarded by some form of mutual exclusion, which could apply to shared memory guarded by OS spinlocks
    but equally to shared application memory with its own synchronization.

    What you referred to earlier as a "critical section" in Windows
    is a _userspace_ synchronization primitive with a kernel assist
    and mostly unrelated to concerns around interruptibility. From
    the description on page 201 of [Rus12], they seem similar to
    futexes on Linux, though somewhat more limited: Russinovich et
    al states that they cannot be used between processes, while a
    Linux futex can if it's in a region of memory shared between
    cooperating processes.

    Yes.

    Traditional Unix "sleep" and "wakeup" are an example of a
    blocking protocol, where a thread may "sleep" on some "channel",
    yielding the locking thread while the lock cannot be acquired,
    presumably scheduling something else until the thread is later
    marked runnable by virtual of something else calling "wakeup" on
    the sleep channel.

    But note that I am deliberately simplifying in order to
    construct a particular scenario in which a light-weight
    synchronization primitive is being used for mutual exclusion
    between concurrent entities, hence mentioning spinlocks
    specifically.

    Suggesting that spin locks iare only applicable to
    multiprocessing scenarios is flatly incorrect.
    I didn't.

    Your exacty words were, "spinlocks coordinate mutual exclusion
    between cpu cores." See the text quoted above. This is
    incorrect in that it ignores that threads are, in contxt, the
    software primitive for the computational entities that are doing
    the acquiring, and while the "between cpu core" part is the
    usual case, it is not the only case.

    I believe I'm being consistent and precise.
    However a *core's* interrupt level gets raised >= Dispatch level, either by
    an interrupt of a lower priority strand of execution, or if a kernel thread making a SysCall, it shuts off the scheduler. At that point the core is no longer executing as a kernel thread - it is executing as a processor core.

    If it then acquires an OS spinlock, it is synchronizing between cores.

    I am saying the OS by design guards *ITS* different data structures
    with different mechanisms, and those mechanism must only be used in
    a very controlled manner. And if you want to do something that requires
    access to one of *ITS* data structures then you must obey *ITS* rules.

    This is axiomatic.

    But I fail to see how it's relevant to the topic at hand. In
    context we're discussing scenarios where a system may want to
    disable interrupts entirely, and whether a particular aspect of
    a hardware design is sufficient to implement critical sections
    generally.

    You seemed to be switching between talking about general spinlocks as
    "any code that spin-waits" whereas I am just referring to OS spinlocks executing at a specific raised IRQL, guarding shared OS data structures.

    In Windows and VMS and from what I understand of *nix spinlocks guard most >> of the core OS data structures with spinlocks acquired at the software
    interrupt level. If you are not at that level then the spinlock will not
    provide the necessary synchronization for access to that data.

    Precisely.

    Yeah. We are synchonized!

    Software may use
    the technique to spin on a "busy" bit on some IP in a device for
    example, even on a uniprocessor system.

    A valid response here might be, "don't context switch from the
    interrupt handler; use a DPC instead". That may be valid, but
    it puts a constraint on the software designer that may be onerus
    in practice: suppose the interrupt happens in the scheduler,
    while examining a run queue or something. A DPC object must be
    available, etc.
    That is exactly what DPC/SoftIrq are design to do - allow the bulk of
    the kernel, like the scheduler, non-paged heap management, IO pre and
    post processing, to be interrupted without causing reentrancy.
    The higher level device interrupt enqueues a request to the lower software >>>> interrupt level, which is processed when it will not cause reentrancy. >>>>
    That constraint is by design and any good OS will crash if violated.
    Yes, I'm aware of the technique. That wasn't the point.

    Further, software must now consider the complexity of
    potentially interruptable critical sections. From the
    standpoint of reasoning about already-complex concurrency issues
    it's simpler to be able to assert that (almost) all interrupt
    delivery can be cheaply disabled entirely, save for very
    special, specific, scenarios like NMIs. Potentially switching
    away from a thread holding a spinlock sort of defeats the
    purpose of a spinlock in the first place, which is a mutex
    primitive designed to avoid the overhead of switching.

    (It occurs to me that perhaps this is why you are going on about
    how VMS and windows uses spin locks, and how they interact with
    interrupt priority levels.)

    I find that because WinNT was a clean-sheet design, much of its terminology
    for things like threads, processes, mutexes, Dpc's, is less ambiguous, particularly when talking about interrupts and scheduling.

    Whereas with VMS and *nix that terminology can get messy because they
    started out with "process" as the unit of scheduled execution and later
    glued threads on, and then Unix glued threads on as a new flavor of forking that doesn't really fork, and Linux with its interrupt handling history of
    back half's vs tasklettes vs SoftIrq.

    This is why I mentioned the terminology thing: threads do not hold
    spinlocks, they hold mutexes.
    See above. Threads can certainly "hold" a spin lock, as they
    can hold any kind of lock. To quote from sec 7.6.1 of [Val96],
    page 202:
    When a particular OS defined spinlock is used to guard a particular OS
    defined data structure, exactly WHEN the OS allows this decides how
    the OS avoids all the nasty interrupt pitfalls you identified.

    How is this relevant to what you wrote earlier and my response,
    quoted just above? My statement was that threads can hold spin
    locks. This is obviously true.

    If you means a kernel thread doing its own spin-wait, sure. But that is a "wait"the OS know nothing about. As far as OS is concerned that thread is
    just executing its applications.

    Whereas an OS spinlock requires IRQL >= Dispatch, at which point it
    is no longer executing as a kernel thread, it is a core.

    |On a uniprocessor, if a thread tries to acquire a spin lock
    |that is already held, it will loop forever. Multiprocessor
    |algorithms, however, must operate correctly regardless of the
    |number of processors, which means that they should handle the
    |uniprocessor case as well. This requires strict adherence to
    |the rule that threads not relinquish control of the CPU while
    |holding a spin lock.
    Yes, which is what I was saying.
    I am describing HOW that is accomplished inside an OS.

    Why? First, it's well-known, and second, that's not relevant to
    the discussion at hand.

    But it also doesn't seem like that's what you're saying; you
    asserted that a thread (a software thread in this context) does
    not "hold" a spin lock; that's incorrect.

    Answered above.

    The way you prevent a thread from relinquishing control while holding a
    spinlock is to first switch hats from thread level to software interrupt
    level, which blocks the scheduler because it also runs at SWIL,
    then acquire the spinlock.

    Raising the interrupt priority level from passive (aka thread) level
    to SWI level (called dispatch level on Windows) blocks the thread switch.
    This ensures non-reentrant algorithms are serialized.
    The core may then acquire spinlocks guarding data structures.

    Just so. Earlier you said that threads cannot hold spin locks;
    in the text quoted just above, you said that they do.

    As for this specific scenario of specific interrupt priority
    levels, I would go further and say that is just a detail. In
    the simplest systems, ie, the kind we might reach for as an
    expository example when discussing these kinds of issues, one
    just disables interrupts completely when holding a spinlock.

    As I think you noted elsewhere, disabling interrupts is the same as
    raising the interrupt level to maximum.

    The general intent is to not do that as that limits concurrency
    and causes higher interrupt latencies.

    Threads and mutexes are a fiction created
    by the OS scheduler, and switching threads while waiting for a mutex
    is exactly what they are supposed to do.

    Spinlocks are used by cores to coordinate access to shared memory by
    things like a OS scheduler looking at list of threads waiting for a mutex. >>> Spinlocks, of the form I was describing, are an optimization for
    cases where the critical section guarded by the lock is short,
    and the overhead of invoking a scheduler and context switching
    to some other thread is higher than just spinning until the lock
    becomes available again. The whole point is to avoid blocking
    and switching.

    E.g., as [Vaj11] puts it (page 98):

    |Spin locks have the advantage of preventing pre-emption by the
    |OS (due to blocking while waiting for the lock)

    That is, the advantage is because the thread does not block and
    yield.
    And I am describing HOW the OS accomplishes this.

    Well, what you are describing is how _a_ specific family of
    operating systems that you are familiar with accomplishes this,
    and that's all well and good, but it's not all that relevant to
    discussing the pitfalls of this "ATOMIC event" proposal or
    addressing the OP's questions about scenarios in which interrupt
    disable is used.

    Disabling the scheduler is the same as raising the priority to SWI level.
    Once the scheduler is disabled THEN you may acquire a spinlock
    that guards a shared data structure accessed at SWI level.

    But you may NOT access a shared data structure accessed at interrupt level. >> For that you'd need an interrupt spinlock acquired at the device ISR
    priority interrupt level. And note that the ISR interrupt priority level
    is assigned by the OS itself, usually when the device is mounted.

    You are presuming a very specific system architecture and going
    into great detail about its implementation. That's fine, and
    perhaps useful I suppose, but again, I don't think it's very
    relevant to the context of this discussion.

    I like to think of it as all having to do with hats.
    The cpu is wearing one of three hats: a thread hat when it pretends to be >>>> a time shared execution machine; a core hat when it acts as a single
    execution machine running non-reentrant things like a scheduler which
    creates the fiction of threads and processes (multiple virtual spaces); >>>> and an interrupt hat when executing nestable reentrant interrupts.

    The interrupt mechanism coordinates exactly when and how we switch hats. >>> I suppose you are lumping system calls into the overall bucket
    of "interrupts" in this model, regardless of whether one might
    use a dedicated supervisor call instruction (e.g., something
    like x86 SYSENTER or SYSCALL or ARM SVC or RISC-V's ECALL)?
    Syscall switches a thread from user mode to supervisor mode.

    Your statement was that "the interrupt mechanism coordinates
    exactly when and how we switch hats". But that's incomplete; if
    a system call blocks, for example, the typical behavior is to
    yield to the scheduler to pick something else to do. Clearly,
    that is another way that coordinates this business of changing
    one's hat.

    Yes, I believe that is how Linux worked originally.
    That just moves the point where it changes privilege mode and raising of priority level that disables the scheduler to be the OS SysCall entry point.

    They later changed to what I described, which is seems to be how many
    other OS also work, because it allows more kernel mode concurrency.

    The hat analogy feels strained to me, and too colloquial to be
    useful.
    I offered it because it answers your prior questions about how
    OS avoid the kinds of deadlocks and race conditions you described.

    I'm afraid you answered a question that I did not ask, then.

    I am discussing the specifics of this architectural proposal,
    and specifically two aspects of it: one is addressing the
    proposer's questions about scenarios where one might want to
    disable interrupts entirely.

    Another, which has emerged somewhat more recently in the thread,
    is this mechanism which seems to be a semi-transactional
    elaboration of LL-SC primitives.

    Any questions I did ask were Socratic in nature, and in the
    context of these architectural discussions. I did not ask any
    questions about basic design or implementation of
    synchronization primitives.

    You claimed I was wrong when I said spinlocks synchronize between cores.
    So I have repeatedly shown why I am correct.

    - Dan C.

    References:

    [Gol94] Ruth E. Goldberg and Saro Saravanan. 1994. _OpenVMS AXP
    Internals and Data Structures_ (Vers. 1.5). Digital Press,
    Newton, MA.

    [Rus12] Mark Russinovich, David A. Solomon and Alex Ionescu.
    2012. _Windows Internals: Part 1_. Microsoft Press, Redmond, WA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Sat Mar 29 10:31:38 2025
    Chris M. Thomasson [2025-03-27 19:14:21] wrote:
    On 3/27/2025 7:02 PM, Stefan Monnier wrote:
    Thinking about Haskell's STM (Software Transactional Memory) library:
    Oh shit. I have had some bad experiences with STM. Way back. Damn live lock scenarios, ect... Issues with LL/SC live lock wrt false sharing on the reservation granules, ect...

    No wonder: STM, like Haskell, is an attempt to push the boundaries of abstraction (or call it programmer convenience), not performance.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Sat Mar 29 14:28:02 2025
    In article <cb049d5490b541878e264cedf95168e1@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 28 Mar 2025 2:53:57 +0000, Dan Cross wrote:
    [snip]
    What I really wanted was an example that exceeded that limit as
    an expository vehicle for understanding what happens when the
    limit is exceeded. What does the hardware do in that case?

    Raise OPERATION (instruction) Fault.

    Ok, very good. The software implications could be profound.

    [snip]
    // by placing all the the touching before any manifestation, you put
    // all the touch latency* in the code before it has tried to damage any >>>// participating memory location. (*) and TLB latency and 2nd party
    // observation of your event.

    // this would be the point where you would insert if( esmINTERFERENCE(
    ))
    // if you wanted control at a known failure point rather than at the
    // top of the event on failure.

    if (Ehead == a)
    *head = b;
    else if (Ehead == b)
    *head = a;

    b->next = an;
    if (an != NULL) {
    an->prev = b;
    }
    b->prev = ap;
    if (ap != NULL) {
    ap->next = b;
    }

    a->next = bn;
    if (bn != NULL) {
    bn->prev = a;
    }
    if (bp != NULL) {
    bp->next = a;
    }
    esmLOCKstore(a->prev, bp);
    }

    // now manifestation has lowest possible latency (as seen by this core >>>alone)

    Doesn't this code now assume that `an->prev` is in the same
    cache line as `an`, and similarly that `bn` is in the same line
    as `bn->prev`? Absent an alignment constraint on the `Node`
    type, that's not guaranteed given the definition posted earlier.

    Everything you state is true, I just tried to move the esmLOCKxxxx
    up to the setup phase to obey ESM rules.

    Yeah, that makes sense. Describing it as a rule, however,
    raises the question of whether one must touch a line before a
    store to a different line, or just before a store to that line?

    That may be an odd and ill-advised thing for a programmer to do
    if they want their list type to work with atomic events, but it
    is possible.

    The node pointers and their respective `next` pointers are okay,
    so I wonder if perhaps this might have been written as:

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    hp = esmLOCKload(*head);
    esmLOCKprefetch(an = esmLOCKload(a->next));
    ap = esmLOCKload(a->prev);
    esmLOCKprefetch(bn = esmLOCKload(b->next));
    bp = esmLOCKload(b->prev);

    if (an != NULL) // I see what you did
    esmLOCKprefetch(an->prev);
    if (bn != NULL) {
    esmLOCKprefetch(bn->prev);
    bn->prev = a;
    }

    if (hp == a)
    *head = b;
    else if (hp == b)
    *head = a;

    b->next = an;
    if (an != NULL)
    an->prev = b;
    b->prev = ap;
    if (ap != NULL)
    ap->next = b;
    // illustrative code
    a->next = bn; // ST Rbp,[Ra,#next]
    if (bp != NULL) // PNE0 Rbp,T
    bp->next = a; // ST Ra,[Rbp,#next]

    esmLOCKstore(a->prev, bp);
    }

    But now the conditional testing whether or not `an` is `NULL` is
    repeated. Is the additional branch overhead worth it here?

    In My 66000 ISA, a compare against zero (or NULL) is just a branch >instruction, so the CMP zero is performed twice, but each use is
    but a single Branch-on-condition instruction (or Predicate-on-
    Condition instruction).

    In the case of using predicates, FETCH/DECODE will simple issue
    both then and else clauses into the execution window (else-clause
    is empty here) and let the reservation stations handle execution
    order. And the condition latency is purely the register dependence
    chain. A 6-wide machine should have no trouble in inserting two
    copies of the code commented by "illustrative code" above--in
    this case 6-instructions or 2 sets of {ST, PNE0, ST}.

    In the case of using a real branch, latency per use is likely to
    be 2-cycles, moderated by typical branch prediction. The condition
    will have resolved early, so we are simply FETCH/DECODE/TAKE bound.

    {{That is: PRED should be faster in almost all conceivable cases.}}

    Okay, that all makes sense. Thanks for the detailed
    explanation. I agree it's very slick; is this documented
    somewhere publicly? If not, why not?

    Generally when I write queue-code, I use a dummy Node front/rear
    such that the checks for Null are unnecessary (at the cost of
    following 1 more ->next or ->prev). That is Q->head and Q->tail
    are never NULL and when the queue is empty there is a Node which
    carries the fact the queue is empty (not using NULL as a pointer).

    But that is just my style.

    Ok.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Sat Mar 29 15:23:39 2025
    On Sat, 29 Mar 2025 14:28:02 +0000, Dan Cross wrote:

    In article <cb049d5490b541878e264cedf95168e1@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 28 Mar 2025 2:53:57 +0000, Dan Cross wrote:
    [snip]
    What I really wanted was an example that exceeded that limit as
    an expository vehicle for understanding what happens when the
    limit is exceeded. What does the hardware do in that case?

    Raise OPERATION (instruction) Fault.

    Ok, very good. The software implications could be profound.

    [snip]---------

    Doesn't this code now assume that `an->prev` is in the same
    cache line as `an`, and similarly that `bn` is in the same line
    as `bn->prev`? Absent an alignment constraint on the `Node`
    type, that's not guaranteed given the definition posted earlier.

    Everything you state is true, I just tried to move the esmLOCKxxxx
    up to the setup phase to obey ESM rules.

    Yeah, that makes sense. Describing it as a rule, however,
    raises the question of whether one must touch a line before a
    store to a different line, or just before a store to that line?

    Right from the spec::

    "There is an order of instructions imposed upon software <i.e.,
    compiler> where:

    • all participating inbound memory reference instructions shall be performed prior to manifestation;
    • non-participating inbound memory reference instructions should be performed prior to manifestation;
    • the <optional> query instruction denotes the boundary between setup
    and manifestation;
    • the first Outbound to a participating cache line <also> begins manifestation
    • the only Outbound with Lock bit set (L ≡ 1) completes the event

    The processor monitors the instruction sequence of the ATOMIC event and
    will raise the OPERATION exception when the imposed order is violated."

    Basically, all participating page-faults happen prior to any
    modification
    attempts.



    That may be an odd and ill-advised thing for a programmer to do
    if they want their list type to work with atomic events, but it
    is possible.

    The node pointers and their respective `next` pointers are okay,
    so I wonder if perhaps this might have been written as:

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    hp = esmLOCKload(*head);
    esmLOCKprefetch(an = esmLOCKload(a->next));
    ap = esmLOCKload(a->prev);
    esmLOCKprefetch(bn = esmLOCKload(b->next));
    bp = esmLOCKload(b->prev);

    if (an != NULL) // I see what you did
    esmLOCKprefetch(an->prev);
    if (bn != NULL) {
    esmLOCKprefetch(bn->prev);
    bn->prev = a;
    }

    if (hp == a)
    *head = b;
    else if (hp == b)
    *head = a;

    b->next = an;
    if (an != NULL)
    an->prev = b;
    b->prev = ap;
    if (ap != NULL)
    ap->next = b;
    // illustrative code
    a->next = bn; // ST Rbp,[Ra,#next]
    if (bp != NULL) // PNE0 Rbp,T
    bp->next = a; // ST Ra,[Rbp,#next]

    esmLOCKstore(a->prev, bp);
    }

    But now the conditional testing whether or not `an` is `NULL` is
    repeated. Is the additional branch overhead worth it here?

    In My 66000 ISA, a compare against zero (or NULL) is just a branch >>instruction, so the CMP zero is performed twice, but each use is
    but a single Branch-on-condition instruction (or Predicate-on-
    Condition instruction).

    In the case of using predicates, FETCH/DECODE will simple issue
    both then and else clauses into the execution window (else-clause
    is empty here) and let the reservation stations handle execution
    order. And the condition latency is purely the register dependence
    chain. A 6-wide machine should have no trouble in inserting two
    copies of the code commented by "illustrative code" above--in
    this case 6-instructions or 2 sets of {ST, PNE0, ST}.

    In the case of using a real branch, latency per use is likely to
    be 2-cycles, moderated by typical branch prediction. The condition
    will have resolved early, so we are simply FETCH/DECODE/TAKE bound.

    {{That is: PRED should be faster in almost all conceivable cases.}}

    Okay, that all makes sense. Thanks for the detailed
    explanation. I agree it's very slick; is this documented
    somewhere publicly? If not, why not?

    Documented: Yes 21 pages.
    Public: I keep holding back as if I were to attempt to patent certain
    aspects. But I don't seem to have the energy to do such. I as if you
    wanted to see it.

    Generally when I write queue-code, I use a dummy Node front/rear
    such that the checks for Null are unnecessary (at the cost of
    following 1 more ->next or ->prev). That is Q->head and Q->tail
    are never NULL and when the queue is empty there is a Node which
    carries the fact the queue is empty (not using NULL as a pointer).

    But that is just my style.

    Ok.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to mitchalsup@aol.com on Sat Mar 29 20:58:44 2025
    In article <1b9a9644c3f5cbd2985b89443041e01a@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 29 Mar 2025 14:28:02 +0000, Dan Cross wrote:
    In article <cb049d5490b541878e264cedf95168e1@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 28 Mar 2025 2:53:57 +0000, Dan Cross wrote:
    [snip]
    What I really wanted was an example that exceeded that limit as
    an expository vehicle for understanding what happens when the
    limit is exceeded. What does the hardware do in that case?

    Raise OPERATION (instruction) Fault.

    Ok, very good. The software implications could be profound.

    [snip]---------

    Doesn't this code now assume that `an->prev` is in the same
    cache line as `an`, and similarly that `bn` is in the same line
    as `bn->prev`? Absent an alignment constraint on the `Node`
    type, that's not guaranteed given the definition posted earlier.

    Everything you state is true, I just tried to move the esmLOCKxxxx
    up to the setup phase to obey ESM rules.

    Yeah, that makes sense. Describing it as a rule, however,
    raises the question of whether one must touch a line before a
    store to a different line, or just before a store to that line?

    Right from the spec::

    "There is an order of instructions imposed upon software <i.e.,
    compiler> where:

    • all participating inbound memory reference instructions shall be >performed prior to manifestation;
    • non-participating inbound memory reference instructions should be >performed prior to manifestation;
    • the <optional> query instruction denotes the boundary between setup >and manifestation;
    • the first Outbound to a participating cache line <also> begins >manifestation
    • the only Outbound with Lock bit set (L ≡ 1) completes the event

    The processor monitors the instruction sequence of the ATOMIC event and
    will raise the OPERATION exception when the imposed order is violated."

    Basically, all participating page-faults happen prior to any
    modification
    attempts.

    Ok.

    That may be an odd and ill-advised thing for a programmer to do
    if they want their list type to work with atomic events, but it
    is possible.

    The node pointers and their respective `next` pointers are okay,
    so I wonder if perhaps this might have been written as:

    void
    swap_places(Node **head, Node *a, Node *b)
    {
    Node *hp, *an, *ap, *bn, *bp;

    assert(head != NULL);
    assert(a != NULL);
    assert(b != NULL);

    if (a == b)
    return;

    hp = esmLOCKload(*head);
    esmLOCKprefetch(an = esmLOCKload(a->next));
    ap = esmLOCKload(a->prev);
    esmLOCKprefetch(bn = esmLOCKload(b->next));
    bp = esmLOCKload(b->prev);

    if (an != NULL) // I see what you did
    esmLOCKprefetch(an->prev);
    if (bn != NULL) {
    esmLOCKprefetch(bn->prev);
    bn->prev = a;
    }

    if (hp == a)
    *head = b;
    else if (hp == b)
    *head = a;

    b->next = an;
    if (an != NULL)
    an->prev = b;
    b->prev = ap;
    if (ap != NULL)
    ap->next = b;
    // illustrative code
    a->next = bn; // ST Rbp,[Ra,#next]
    if (bp != NULL) // PNE0 Rbp,T
    bp->next = a; // ST Ra,[Rbp,#next]

    esmLOCKstore(a->prev, bp);
    }

    But now the conditional testing whether or not `an` is `NULL` is
    repeated. Is the additional branch overhead worth it here?

    In My 66000 ISA, a compare against zero (or NULL) is just a branch >>>instruction, so the CMP zero is performed twice, but each use is
    but a single Branch-on-condition instruction (or Predicate-on-
    Condition instruction).

    In the case of using predicates, FETCH/DECODE will simple issue
    both then and else clauses into the execution window (else-clause
    is empty here) and let the reservation stations handle execution
    order. And the condition latency is purely the register dependence
    chain. A 6-wide machine should have no trouble in inserting two
    copies of the code commented by "illustrative code" above--in
    this case 6-instructions or 2 sets of {ST, PNE0, ST}.

    In the case of using a real branch, latency per use is likely to
    be 2-cycles, moderated by typical branch prediction. The condition
    will have resolved early, so we are simply FETCH/DECODE/TAKE bound.

    {{That is: PRED should be faster in almost all conceivable cases.}}

    Okay, that all makes sense. Thanks for the detailed
    explanation. I agree it's very slick; is this documented
    somewhere publicly? If not, why not?

    Documented: Yes 21 pages.
    Public: I keep holding back as if I were to attempt to patent certain >aspects. But I don't seem to have the energy to do such. I as if you
    wanted to see it.

    (I assume you meant "ask if you want to see it"?)

    Sure, I think it would be interesting. Should I send you an
    email?

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Dan Cross on Sat Mar 29 22:04:24 2025
    On Sat, 29 Mar 2025 20:58:44 +0000, Dan Cross wrote:

    In article <1b9a9644c3f5cbd2985b89443041e01a@www.novabbs.org>,
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    ------------
    Okay, that all makes sense. Thanks for the detailed
    explanation. I agree it's very slick; is this documented
    somewhere publicly? If not, why not?

    Documented: Yes 21 pages.
    Public: I keep holding back as if I were to attempt to patent certain >>aspects. But I don't seem to have the energy to do such. I as if you
    wanted to see it.

    (I assume you meant "ask if you want to see it"?)

    Sure, I think it would be interesting. Should I send you an
    email?

    Yes, exactly.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Dan Cross on Sun Mar 30 16:34:16 2025
    Dan Cross wrote:
    In article <t3CFP.529522$f81.217510@fx48.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <8uzEP.115550$Xq5f.66883@fx38.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    [snip]
    Consider a thread that takes a spinlock; suppose some
    high-priority interrupt comes in while the thread is holding
    that lock. In response to the interrupt, software decides to
    suspend the thread and switch some other thread; that thread
    wants to lock the spin lock that the now-descheduled thread is
    holding: a classic deadlock scenario.
    Terminology: mutexes coordinate mutual exclusion between threads,
    spinlocks coordinate mutual exclusion between cpu cores.
    Windows "critical sections" are mutexes with a fast path.
    A spin lock is simply any lock where you spin trying to acquire
    the lock, as opposed to a blocking synchronization protocol.
    Here I'm using the terminology of Herlihy and Shavit [Her08].
    I'm using terminology in common use on multiprocessor systems like
    VMS since 1980's and later WinNT.

    Each OS data structure has different reentrancy requirements.
    - Ones accessed by threads are guarded by mutexes.
    Example on Windows would be the page tables as paging is done by threads.
    - Ones accessed by OS software interrupt level are guarded by spinlocks. >>>> Example on Windows is the Scheduler tables.
    - Ones accessed by HW interrupts are guarded by interrupt spinlocks.
    Example on Windows is device interrupt service routines.
    I'm not sure what you're saying here, or rather, how it is
    relevant.

    Your earlier statement was that "mutexes coordinate mutual
    exclusion between threads, spinlocks coordinate mutual exclusion
    between cpu cores." But this is not the case.

    Taking VMS as an example, from [Gol94], sec 9.5 (page 255):

    |A thread of execution running on one processor acquires a
    |spinlock to serialize access to the data with threads of
    |execution running on other processors. Before acquiring
    |the spinlock, the thread of executions raises IPL to block
    |accesses by other threads of execution running on the same
    |processor. The IPL value is determined by the spinlock being
    |locked.

    Goldberg and Saravanan seems quite explicit that threads can
    acquire and hold spinlocks on OpenVMS, which seems to be
    something that you were objecting to?
    It took me a while to find copy of that book online.
    That 1994 book relates to VMS before they added kernel threads to the OS.

    Before you said you were using terminology commonly used on
    systems "since [the] 1980s". The systems you cited were VMS and
    Windows. I quoted references about both systems from the mid
    1990s and mid 2000s, respectively. If that terminology were
    common in the 1980s, as you suggested, surely it would be in a
    book from 1994? Regardless, see below.

    SMP OS (synchronized with spinlocks) was added to VMS is 1988.
    VMS clusters were released in 1983 with its locking system and
    cluster read-write locks can be used to build other synchronization
    structures like mutexes, semaphores, quorums.

    I don't know when the term "thread" meaning "scheduled units of execution within a common address space" first show up. I found a reference to it on
    Unix in 1989. Previously the term "task" was used to mean the same thing,
    such as Ada tasks which on DEC Ada in 1986 were similar to what are now
    called user mode threads. It looks to me that the DEC Ada runtime library
    was later morphed into DECThreads user mode library released in 1991.

    At that point VMS scheduled processes, an execution context plus virtual
    space, like most *nix (except I think Solaris which had kernel threads).

    It says on pg 3: "The term thread of execution as used in this book
    refers to a computational entity, an agent of execution."
    This refers generally to the various code sequences from different context >> that may be intertwined on a processors core for interrupts, exceptions,
    Dpc (called driver forks in VMS), process AST's (software interrupts),
    and applications.

    In a later book on VMS after they added kernel scheduled threads they
    consistently use the two terms "thread of execution" and "kernel thread"
    to distinguish the two meanings.

    Are you referring to the 1997 revision of Goldenberg et al,
    that covers VMS 7.0? [Gol97] In that context, VMS "kernel
    threads" refer to threads of execution within a program that
    are managed and scheduled by the VMS executive (aka kernel).

    I saw reference to 1997 internals book and that kernel threads were added. However the only other OpenVMS internals document I could find was
    OpenVMS Alpha Internals and Data Structures - Memory Management,
    Goldenberg et al, 2003.

    This is described on page 151, section 5.1:

    |A thread of execution is a sequential flow of instruction
    |executiuon. A traditional program is single-threaded,
    |consisting of one main thread of execution. In contrast, a
    |multithreaded program divides its work among multiple threads
    |that can run concurrently. OpenVMS supplies DECthreads, a
    |run-time library, to facilitate the creation, synchronization,
    |and deletion of multiple user-mode thredas.
    |
    |OpenVMS Alpha Version 7.0 supports multiple execution contexts
    |within a process: each execution context has its own hardware
    |context and stacks, and canexecute independently of the others;
    |all execution context with a process share the same virtual
    |address space. The kernel _kernel thread_ refers to one of
    |those execution contexts and the thread of execution running
    |within it. This volume uses the term _multithreaded_ to refer
    |to a process with multiple kernel threads.
    |
    |The kernel thread is now the basis for OpenVMS scheduling.
    |[...]

    This "kernel thread" is the definition I have been using:
    unit of execution *scheduled* by the operating system.

    It continues on page 152:

    |The term _kernel thread_ is a somewhat confusing name, in that
    |instructions executing in a kernel thread can run in any of the
    |four access modes, not only kernel mode. The term arose to
    |distinguish the current multithreaded model, supported within
    |the OpenVMS executive (or kernel), from the previous model,
    |supported only by DECthreads.

    Sadly, the later books on OpenVMS internals are more focused on
    specific subsystems (scheduling and process management; memory
    management) than the ealier volumes. Details of the
    implementation of synchronization primitives are almost entirely
    missing.

    There is discussion about how those primitives are _used_,
    however. [Gol03] mentions spinlocks, such as the SCHED
    spinlock, in the context of memory management data structures.
    E.g., on page 51:

    |The swappability of the PHD [Process Header] result in several
    |different methods for synchronizing access to fields within it.
    |Because a PHD can be inswapped to a different balance set slot
    |than it last occuped, accesses to a PHD that use its system
    |space address must be synchronized against swapper
    |interference. Accesses from a kernel thread to its own PHD can
    |be made with the SCHED spinlock held to block any rescheduling
    |and possible swapping of the process. Holding the MMG spinlock
    |is another way to block swapping.

    Notably here, that the language is again in terms of (software)
    threads. It seems clear enough that in the VMS context,
    spinlocks are conceptually held by threads, not cores.

    Also Gol91 VAX VMS Internals and Data Structures v5.2 1991

    All three, Gol91, Gol94, Rus12 have a section on spinlocks and
    explicitly say spinlocks synchronize processors, not threads.

    Gol91 pg 172, Gol94 pg 254: "A spinlock is acquired by a processor to synchronize access to data shared by members of an SMP system."

    Rus12 pg 179: "The mechanism the kernel uses to achieve multiprocessor
    mutual exclusion is called a spinlock."

    Furthermore a search for the word "spinlock " in all three documents
    finds repeated references to "processor spinlock". There are NO usages
    of "process spinlock" on VMS or "thread spinlock" on WinNT.

    It is mutexes that synchronize VMS processes or WNT threads.
    The difference is spinlocks do not invoke the scheduler to trigger
    an executable context switch, whereas mutexes may invoke the scheduler.

    There is a section on mutexes in
    Gol91 pg 196, Gol04 pg 277, Rus12 pg 183 Low-IRQL Synchronization

    Gol91 pg 25, Gol94 pg 29
    "Thus, the VMS executive requires a third synchronization tool to allow synchronized access to pageable data structures. This tool must also allow
    a process to be removed from execution while it maintains ownership of the structure in question. One synchronization tool that fulfills these requirements is called a mutual exclusion semaphore (mutex)."

    Gol91 pg 196, Gol94 pg 277
    "However, in some situations requiring synchronization, elevated IPL is an unacceptable technique."
    ...
    "Typically, the threads of execution whose accesses are synchronized
    through a mutex are process context threads."

    Rus12 pg 183 Low-IRQL Synchronization
    "Because waiting for a spinlock literally stalls a processor, spinlocks
    can be used only under the following strictly limited circumstances"
    ...
    "There are several additional synchronization mechanisms for use when
    spinlocks are not suitable:
    - Kernel dispatcher objects
    - Fast mutexes and guarded mutexes
    - Pushlocks
    - Executive resources

    Additionally, user-mode code, which also executes at low IRQL, must be able
    to have its own locking primitives. Windows supports various
    user-mode-specific primitives:
    - Condition variables (CondVars)
    - Slim Reader-Writer Locks (SRW Locks)
    - Run-once initialization (InitOnce)
    - Critical sections"

    These books consistently refer to spinlocks synchronize *processors*,
    aka cores, at raised IPL (>= dispatch), and mutexes synchronize scheduled processes/threads at lower IPL (< dispatch).

    The reason is that when a kernel thread, executing a system call,
    raises the processor >= dispatch, which is required to acquire a spinlock,
    it stops being a scheduled thread, and may have temporarily switched
    kernel stacks as processor interrupt levels require their own stacks.
    At raised IPL many of the OS data structures associated with a kernel thread cannot be touched as they might trigger a page fault which can only be
    serviced by a kernel thread which can sleep, and not by processors/cores
    which can only spin-wait. (Both VMS and WNT will crash/panic if code
    page faults at IPL >= dispatch for exactly this reason.)

    All of this is for controlling access to non-reentrant shared data. Non-reentrant data shared between processors/cores is controlled by
    IPL and blocked by spinlocks. Non-reentrant data shared between kernel
    threads is controlled with scheduler synchronization and thread switching.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to ThatWouldBeTelling@thevillage.com on Mon Mar 31 03:33:45 2025
    In article <tWhGP.1790040$TBhc.304536@fx16.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <t3CFP.529522$f81.217510@fx48.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <8uzEP.115550$Xq5f.66883@fx38.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <kVgEP.1277108$_N6e.605199@fx17.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    [snip]
    Consider a thread that takes a spinlock; suppose some
    high-priority interrupt comes in while the thread is holding
    that lock. In response to the interrupt, software decides to
    suspend the thread and switch some other thread; that thread
    wants to lock the spin lock that the now-descheduled thread is >>>>>>>> holding: a classic deadlock scenario.
    Terminology: mutexes coordinate mutual exclusion between threads, >>>>>>> spinlocks coordinate mutual exclusion between cpu cores.
    Windows "critical sections" are mutexes with a fast path.
    A spin lock is simply any lock where you spin trying to acquire
    the lock, as opposed to a blocking synchronization protocol.
    Here I'm using the terminology of Herlihy and Shavit [Her08].
    I'm using terminology in common use on multiprocessor systems like
    VMS since 1980's and later WinNT.

    Each OS data structure has different reentrancy requirements.
    - Ones accessed by threads are guarded by mutexes.
    Example on Windows would be the page tables as paging is done by threads.
    - Ones accessed by OS software interrupt level are guarded by spinlocks. >>>>> Example on Windows is the Scheduler tables.
    - Ones accessed by HW interrupts are guarded by interrupt spinlocks. >>>>> Example on Windows is device interrupt service routines.
    I'm not sure what you're saying here, or rather, how it is
    relevant.

    Your earlier statement was that "mutexes coordinate mutual
    exclusion between threads, spinlocks coordinate mutual exclusion
    between cpu cores." But this is not the case.

    Taking VMS as an example, from [Gol94], sec 9.5 (page 255):

    |A thread of execution running on one processor acquires a
    |spinlock to serialize access to the data with threads of
    |execution running on other processors. Before acquiring
    |the spinlock, the thread of executions raises IPL to block
    |accesses by other threads of execution running on the same
    |processor. The IPL value is determined by the spinlock being
    |locked.

    Goldberg and Saravanan seems quite explicit that threads can
    acquire and hold spinlocks on OpenVMS, which seems to be
    something that you were objecting to?
    It took me a while to find copy of that book online.
    That 1994 book relates to VMS before they added kernel threads to the OS. >>
    Before you said you were using terminology commonly used on
    systems "since [the] 1980s". The systems you cited were VMS and
    Windows. I quoted references about both systems from the mid
    1990s and mid 2000s, respectively. If that terminology were
    common in the 1980s, as you suggested, surely it would be in a
    book from 1994? Regardless, see below.

    SMP OS (synchronized with spinlocks) was added to VMS is 1988.
    VMS clusters were released in 1983 with its locking system and
    cluster read-write locks can be used to build other synchronization >structures like mutexes, semaphores, quorums.

    I don't know when the term "thread" meaning "scheduled units of execution >within a common address space" first show up. I found a reference to it on >Unix in 1989. Previously the term "task" was used to mean the same thing, >such as Ada tasks which on DEC Ada in 1986 were similar to what are now >called user mode threads. It looks to me that the DEC Ada runtime library
    was later morphed into DECThreads user mode library released in 1991.

    The point is that if, as you assert, Windows and VMS both use
    the terminology of _cores_ (or processors) acquiring and holding
    spinlocks instead of threads doing so, there should be some
    reference to that in those books. There is not. The
    terminology on those systems is consistent: threads are the
    things that hold spinlocks; it's axiomatic that those threads
    are running on processors, since the whole point is that it is
    faster to busy-wait on the spinlock rather than potentially
    context switch as with a different mutual exclusion primtiive,
    like what VMS calles a mutex (which is a specific software
    object, not the generic concept of a "mutex" as a mutual
    exclusion primitive. In the latter context, a spinlock is a
    kind of mutex; in the former, it's different).

    At that point VMS scheduled processes, an execution context plus virtual >>> space, like most *nix (except I think Solaris which had kernel threads). >>>
    It says on pg 3: "The term thread of execution as used in this book
    refers to a computational entity, an agent of execution."
    This refers generally to the various code sequences from different context >>> that may be intertwined on a processors core for interrupts, exceptions, >>> Dpc (called driver forks in VMS), process AST's (software interrupts),
    and applications.

    In a later book on VMS after they added kernel scheduled threads they
    consistently use the two terms "thread of execution" and "kernel thread" >>> to distinguish the two meanings.

    Are you referring to the 1997 revision of Goldenberg et al,
    that covers VMS 7.0? [Gol97] In that context, VMS "kernel
    threads" refer to threads of execution within a program that
    are managed and scheduled by the VMS executive (aka kernel).

    I saw reference to 1997 internals book and that kernel threads were added. >However the only other OpenVMS internals document I could find was
    OpenVMS Alpha Internals and Data Structures - Memory Management,
    Goldenberg et al, 2003.

    Ok, I have that one, too.

    This is described on page 151, section 5.1:

    |A thread of execution is a sequential flow of instruction
    |executiuon. A traditional program is single-threaded,
    |consisting of one main thread of execution. In contrast, a
    |multithreaded program divides its work among multiple threads
    |that can run concurrently. OpenVMS supplies DECthreads, a
    |run-time library, to facilitate the creation, synchronization,
    |and deletion of multiple user-mode thredas.
    |
    |OpenVMS Alpha Version 7.0 supports multiple execution contexts
    |within a process: each execution context has its own hardware
    |context and stacks, and canexecute independently of the others;
    |all execution context with a process share the same virtual
    |address space. The kernel _kernel thread_ refers to one of
    |those execution contexts and the thread of execution running
    |within it. This volume uses the term _multithreaded_ to refer
    |to a process with multiple kernel threads.
    |
    |The kernel thread is now the basis for OpenVMS scheduling.
    |[...]

    This "kernel thread" is the definition I have been using:
    unit of execution *scheduled* by the operating system.

    But that's not particularly relevant. The interesting bit vis
    spinlocks in the kernel is what threads are running in kernel
    mode, and how those threads synchronize between themselves to
    control access to kernel objects.

    It continues on page 152:

    |The term _kernel thread_ is a somewhat confusing name, in that
    |instructions executing in a kernel thread can run in any of the
    |four access modes, not only kernel mode. The term arose to
    |distinguish the current multithreaded model, supported within
    |the OpenVMS executive (or kernel), from the previous model,
    |supported only by DECthreads.

    Sadly, the later books on OpenVMS internals are more focused on
    specific subsystems (scheduling and process management; memory
    management) than the ealier volumes. Details of the
    implementation of synchronization primitives are almost entirely
    missing.

    There is discussion about how those primitives are _used_,
    however. [Gol03] mentions spinlocks, such as the SCHED
    spinlock, in the context of memory management data structures.
    E.g., on page 51:

    |The swappability of the PHD [Process Header] result in several
    |different methods for synchronizing access to fields within it.
    |Because a PHD can be inswapped to a different balance set slot
    |than it last occuped, accesses to a PHD that use its system
    |space address must be synchronized against swapper
    |interference. Accesses from a kernel thread to its own PHD can
    |be made with the SCHED spinlock held to block any rescheduling
    |and possible swapping of the process. Holding the MMG spinlock
    |is another way to block swapping.

    Notably here, that the language is again in terms of (software)
    threads. It seems clear enough that in the VMS context,
    spinlocks are conceptually held by threads, not cores.

    Also Gol91 VAX VMS Internals and Data Structures v5.2 1991

    All three, Gol91, Gol94, Rus12 have a section on spinlocks and
    explicitly say spinlocks synchronize processors, not threads.

    Gol91 pg 172, Gol94 pg 254: "A spinlock is acquired by a processor to >synchronize access to data shared by members of an SMP system."

    Also [Gol91], page 173:

    |A resource synchronized through elevated IPL on a uniprocessor
    |is synchronzied through a combination of spinlock and elevated
    |IPL on an SMP system. *A thread of execution* running on one
    |processor acquires a spinlock to serialized access to the data
    |with *threads of execution* running on other processors.
    |Before acquiring, the _thread of execution_ raises IPL to block
    |accesses *by other threads of execution* running on the *same*
    |processor.

    Emphasis added. Note that the primitives that do the acquiring
    and holding are threads.

    (Note, I already quoted this text from [Gol94] in an earlier
    response; page 255.)

    Rus12 pg 179: "The mechanism the kernel uses to achieve multiprocessor
    mutual exclusion is called a spinlock."

    Yes, I already pointed out that this is the usual case. But you
    are still ignoring the uniprocessor case: spinlocks must work
    there, as well.

    But more relevantly, this does not imply that _processors_ "own"
    spinlocks in the way you have asserted. That is done by
    threads. A thread does not magically "turn into" a processor
    when it acquires a spinlock; it remains a thread, albeit one
    that is now holding a spinlock that it wasn't before.

    Furthermore a search for the word "spinlock " in all three documents
    finds repeated references to "processor spinlock". There are NO usages
    of "process spinlock" on VMS or "thread spinlock" on WinNT.

    I don't know what "process spinlock" you are referring to on VMS
    or why you are mentioning it; this seems to be a strawman.

    With respect to Windows, unless you have some very specific
    meaning in mind for the term "thread spinlock", but if one takes
    this to be a generic search for the two terms "thread" and
    "spinlock" to see if they occur together, this is demonstrably
    false: I have repeatedly provided specific references and
    quotations showing that Windows refers to _threads_ acquiring
    spinlocks.

    It is mutexes that synchronize VMS processes or WNT threads.

    I don't know what processes on VMS have to do with it. The
    available documentation quite clearly says that spinlocks are
    one of the synchronization mechanisms used by VMS to serialize
    access to critical sections among multiple threads of execution,
    both on the same processor and on other processors in
    multiprocessor configurations. Remember: my objection to your
    terminology is that cores/processors are the things that hold
    spinlocks.

    The difference is spinlocks do not invoke the scheduler to trigger
    an executable context switch, whereas mutexes may invoke the scheduler.

    That's obvious, and I've stated as much repeatedly. Refer to my
    earlier posts where I drew a distinction between blocking
    lock protocols as opposed to spinning protocols. Actually, to
    quote myself from earlier in the thread:

    |A spin lock is simply any lock where you spin trying to acquire
    |the lock, as opposed to a blocking synchronization protocol.
    |Here I'm using the terminology of Herlihy and Shavit [Her08].
    |
    |Traditional Unix "sleep" and "wakeup" are an example of a
    |blocking protocol, where a thread may "sleep" on some
    |"channel", yielding the locking thread while the lock cannot be
    |acquired, presumably scheduling something else until the thread
    |is later marked runnable by virtual of something else calling
    |"wakeup" on the sleep channel.

    There is a section on mutexes in
    Gol91 pg 196, Gol04 pg 277, Rus12 pg 183 Low-IRQL Synchronization

    This is all presented wildly out of context, and doesn't support
    your earlier assertion, that spinlocks are held by cores and not
    threads.

    Gol91 pg 25, Gol94 pg 29
    "Thus, the VMS executive requires a third synchronization tool to allow >synchronized access to pageable data structures. This tool must also allow
    a process to be removed from execution while it maintains ownership of the >structure in question. One synchronization tool that fulfills these >requirements is called a mutual exclusion semaphore (mutex)."

    Yup. Use of VMS "mutex" primitives is a blocking protocol, as
    described above.

    Gol91 pg 196, Gol94 pg 277
    "However, in some situations requiring synchronization, elevated IPL is an >unacceptable technique."
    ...
    "Typically, the threads of execution whose accesses are synchronized
    through a mutex are process context threads."

    Rus12 pg 183 Low-IRQL Synchronization
    "Because waiting for a spinlock literally stalls a processor, spinlocks
    can be used only under the following strictly limited circumstances"

    You omitted important context that makes this clearer. This
    whole section begins, "Executive software outside the kernel
    also needs to synchronize access to global data structures in
    a multiprocessor environment." Here, they're describing other
    mechanisms for achieving synchronization between competing
    software running on different CPUs.

    Here's the full quote corresponding to the snippet above:

    |Spinlocks only partially fill the executive's need for
    |synchronization mechanisms, however. Because waiting for a
    |spinlock literally stalls a processor, spinlocks can only be
    |used under the following strictly limited circumstances:
    |
    |* The protected resource must be access quickly and without
    | complicated interactions with other code.
    |* The critical section code can't be paged out of memory,
    | can't make references to pageable data, can't call external
    | procedures (including system services), and can't generate
    | interrupts or exceptions.

    (One presumes they mean synchronous interrupts, here; as we've
    seen, asynchronous interrupts are blocked by means of setting
    the IRQL.)

    But this is all well-known; and the first point is also
    something I've repeatedly throughout this discussion. Quoting
    myself again:

    |Spinlocks, of the form I was describing, are an optimization
    |for cases where the critical section guarded by the lock is
    |short, and the overhead of invoking a scheduler and context
    |switching to some other thread is higher than just spinning
    |until the lock becomes available again. The whole point is
    |to avoid blocking and switching.

    These criteria from Russinovich et al are entirely consistent
    with this.

    "There are several additional synchronization mechanisms for use when >spinlocks are not suitable:
    - Kernel dispatcher objects
    - Fast mutexes and guarded mutexes
    - Pushlocks
    - Executive resources

    Additionally, user-mode code, which also executes at low IRQL, must be able >to have its own locking primitives. Windows supports various >user-mode-specific primitives:
    - Condition variables (CondVars)
    - Slim Reader-Writer Locks (SRW Locks)
    - Run-once initialization (InitOnce)
    - Critical sections"

    These books consistently refer to spinlocks synchronize *processors*,

    No, they really do not, as the other direct quotations I have
    provided show.

    Spinlocks synchronize threads of execution, either on
    uniprocessor or multiprocessor systems. On the latter, they
    synchronize between multiple threads of execution either locally
    or between processor cores; on uniprocessors.

    Really, this is standard and normal, and is pretty much how all
    relevant systems describe it, including Windows and VMS.

    aka cores, at raised IPL (>= dispatch), and mutexes synchronize scheduled >processes/threads at lower IPL (< dispatch).

    See above.

    The reason is that when a kernel thread, executing a system call,
    raises the processor >= dispatch, which is required to acquire a spinlock,
    it stops being a scheduled thread, and may have temporarily switched
    kernel stacks as processor interrupt levels require their own stacks.

    I think perhaps you mean that processor _modes_, which on x86_64
    means the processor's current protection ring, require their own
    stacks. Windows supports three broad categories of stacks: for
    each kernel-managed thread, a user stack and a kernel stack, and
    for each physical processor, a DPC stack (since, as Russinovich
    et al put it, "DPC's run in arbitrary thread context.")

    When a thread is running in user mode (ring 3), it's running on
    its user-mode stack; when running in kernel mode (ring 0), it's
    running on its kernel-mode stack. DPCs are run on their own
    stack (presumably switched to via some kind of trampoline).
    Systems running on x86 usually also have a handful of special
    stacks for specific types of interrupts (NMIs, double-fault
    handlers, debug interrupts, and machine-check exceptions come to
    mind). A field in the IDT entry for any given
    interrupt/exception vector points to the index in the interrupt
    stack table in the TSS that should be used for that
    interrupt/exception; if not specified, the default ring 0 stack
    pointed to by the RSP0 entry is used. If an trap occurs while
    running in ring 3, the appropriate stack for the trap will be
    selected and the processor will switch to it automatically. If
    already running in ring 0, this won't happen unless the trap
    vector entry in the IDT specifies a non-zero IST index, in which
    case that stack will be selected and switched to
    unconditionally. Note that there are also fields in the TSS for
    stacks for rings 1 and 2, but almost no one uses those (Windows
    does not), in large part because the page table format used by
    Intel and AMD only distinguishes between 0 and 3 for page-level
    permissions.

    The sysenter/syscall mechanism is a little more involved here
    (in particular, it does not automatically load the stack pointer
    register to point to the kernel-mode stack), and the system call
    entry point (e.g., the address pointed to by the LSTAR MSR on
    x64 when considering SYSCALL) must be carefully written.

    Since on entry to the code pointed at by LSTAR the stack pointer
    for the currently executing thread is not changed, %rsp still
    points to the user-mode stack. But the processor is in kernel
    mode, so interrupt delivery won't generally trigger switching to
    the stack in RSP0 in the TSS (if specified in the IDT entry for
    a given trap, switching to one of the other stacks in the IST
    will happen, but those are typically reserved for a limited set
    of traps that require special handling with respect to stacks;
    there are only seven IST entries in the TSS anyway). So a
    system may inhibit interrupt delivery for a short time until it
    has saved user-mode state and set up it's kernel mode stack.
    But systems need not inhibit interrupt delivery while servicing
    most system calls, and indeed, they often do not. Importantly,
    inhibiting interrupt delivery initially isn't to avoid being
    rescheduled, but rather to protect the kernel from faulting if
    the user mode stack is trashed (and to avoid accidentally
    leaking kernel data into user mode via taking an interrupt on
    the user-mode stack).

    At raised IPL many of the OS data structures associated with a kernel thread >cannot be touched as they might trigger a page fault which can only be >serviced by a kernel thread which can sleep, and not by processors/cores >which can only spin-wait. (Both VMS and WNT will crash/panic if code
    page faults at IPL >= dispatch for exactly this reason.)

    No, that's not really the issue.

    A page fault is a synchronous exception; similarly with a
    software interrupt (e.g., the effect of the `INT` instruction on
    x86). These are always delivered regardless of what the current
    thread has set the interrupt priority threshold to on the
    current processor, or whether it has locally inhibited interrupt
    delivery (e.g., via the `CLI` instruction on x86_64). If such
    were to be allowed while the current thread is holding a
    spinlock, then the trap handler could try and acquire that same
    spinlock (perhaps indirectly), which would result in a deadlock.
    Similarly with why threads holding spinlocks shouldn't invoke
    system calls themselves.

    The statement, "...and not by processors/cores which can only
    spin-wait" is not meaningful. Again "processors" and "cores"
    are hardware devices, not software objects. They are not
    restricted to spin-waiting; they just execute instructions and
    do other processor-y things.

    You continue to try and force these terms to mean something else
    but still have not produced any reference that uses those terms
    in that manner.

    All of this is for controlling access to non-reentrant shared data. >Non-reentrant data shared between processors/cores is controlled by
    IPL and blocked by spinlocks. Non-reentrant data shared between kernel >threads is controlled with scheduler synchronization and thread switching.

    I'm afraid you've got the details rather muddled.

    For instance, saying that, "non-reentrant data shared between
    processors/cores is controlled by IPL and blocked by spinlocks"
    is rather meaningless; those words just don't related to each
    other that way.

    The correct thing to say is that spinlocks are mutual exclusion
    structures that can be used to serialize access to data shared
    between multiple concurrent threads of execution, running either
    on the same processor or in parallel across multiple processors.

    You also don't explain how "non-reentrant data shared between
    kernel threads" is appreciably different from, "non-reentrant
    data shared between processors/cores." For that matter, one
    does not usually speak of _data_ as being reentrant; rather,
    reentrancy is a concept one usually applies to programs.

    I think this stems from this idea you seem have that threads
    somehow turn into "processors/cores", whatever that means, as
    these are obviously not the hardware devices, when interrupts
    are disabled on the CPUs they are running on. Near as I can
    tell this is your own unique invention, and nothing else uses
    that terminology; use by the systems you cited is not supported
    by evidence.

    Consequently, I find your repeated assertions about these terms
    rather strange.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Dan Cross on Tue Apr 1 10:07:57 2025
    Dan Cross wrote:
    In article <tWhGP.1790040$TBhc.304536@fx16.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Also Gol91 VAX VMS Internals and Data Structures v5.2 1991

    All three, Gol91, Gol94, Rus12 have a section on spinlocks and
    explicitly say spinlocks synchronize processors, not threads.

    Gol91 pg 172, Gol94 pg 254: "A spinlock is acquired by a processor to
    synchronize access to data shared by members of an SMP system."

    Also [Gol91], page 173:

    |A resource synchronized through elevated IPL on a uniprocessor
    |is synchronzied through a combination of spinlock and elevated
    |IPL on an SMP system. *A thread of execution* running on one
    |processor acquires a spinlock to serialized access to the data
    |with *threads of execution* running on other processors.
    |Before acquiring, the _thread of execution_ raises IPL to block
    |accesses *by other threads of execution* running on the *same*
    |processor.

    Emphasis added. Note that the primitives that do the acquiring
    and holding are threads.

    (Note, I already quoted this text from [Gol94] in an earlier
    response; page 255.)

    And I've tried to get you to stop treating the terms "thread of execution"
    and "kernel thread" as equivalent. They are not.

    It says "thread of execution" which as I have explained is not a
    necessarily a kernel thread (or in this case a VMS process).
    It could be an interrupt or a Dpc, or code triggered as a consequence
    of actions of any of those which has no connection to whichever
    random kernel thread was running on a processor at that time.

    All of those "threads of execution" could acquire a spinlock at their associated IPL but only one of those is due to the current kernel thread.

    Even if that was a kernel thread that raised its IPL and performed
    some kernel action, when it tries to lower its IPL < dispatch,
    that can trigger the Dispatcher which may invoke the Scheduler
    and ThreadSwitcher. As that transition into Dispatcher might trigger
    a thread switch, it conceptually[1] dumps its thread register context
    in case a switch is needed.

    And all of those consequences which were not directly invoked by the
    current kernel thread but were triggered by it could acquire spinlocks.
    Those triggered "threads of execution" were set up by other
    "threads of execution", some of which may be kernel threads,
    but are unrelated to the currently executing kernel thread.

    On entry to the Dispatcher, the register set, the thread of execution, conceptually[1] is no longer connected to whichever kernel thread
    was running on the processor. The private processor state information
    maintains a reference to its current kernel thread whose context will be restored when it exits the Dispatcher.

    [1] I say conceptually because with careful layering of the code for
    entering and leaving the Dispatcher/Scheduler/ThreadSwitcher the OS
    can defer saving some of the kernel thread's registers into the kernel
    thread context until the actual thread switch occurs, as long as it
    does not modify those registers. x64 FPU or SIMD registers might be examples.

    Rus12 pg 179: "The mechanism the kernel uses to achieve multiprocessor
    mutual exclusion is called a spinlock."

    Yes, I already pointed out that this is the usual case. But you
    are still ignoring the uniprocessor case: spinlocks must work
    there, as well.

    But more relevantly, this does not imply that _processors_ "own"
    spinlocks in the way you have asserted. That is done by
    threads. A thread does not magically "turn into" a processor
    when it acquires a spinlock; it remains a thread, albeit one
    that is now holding a spinlock that it wasn't before.

    I've pointed you at the sections of those books which repeatedly refers to
    them as "processor spinlocks". Rus12 on pg 181 even shows a WNT kernel
    debugger command in "EXPERIMENT: Viewing Global Queued Spinlocks" which
    *displays a list of queued spinlocks held by each processor*.
    That list is stored in the Processor Control Region PCR,
    a per-processor status block WNT maintains.

    If that is not enough to convince you, nothing will.

    I've tried my best to explain it and pointed you to the text that supports
    this but you remain unconvinced and this is just going around in circles
    and using up both of our time. As your interpretation works for you and
    mine works for me, we'll just have to agree to disagree.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to ThatWouldBeTelling@thevillage.com on Tue Apr 1 17:39:00 2025
    In article <WrSGP.1475033$eNx6.441549@fx14.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Dan Cross wrote:
    In article <tWhGP.1790040$TBhc.304536@fx16.iad>,
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Also Gol91 VAX VMS Internals and Data Structures v5.2 1991

    All three, Gol91, Gol94, Rus12 have a section on spinlocks and
    explicitly say spinlocks synchronize processors, not threads.

    Gol91 pg 172, Gol94 pg 254: "A spinlock is acquired by a processor to
    synchronize access to data shared by members of an SMP system."

    Also [Gol91], page 173:

    |A resource synchronized through elevated IPL on a uniprocessor
    |is synchronzied through a combination of spinlock and elevated
    |IPL on an SMP system. *A thread of execution* running on one
    |processor acquires a spinlock to serialized access to the data
    |with *threads of execution* running on other processors.
    |Before acquiring, the _thread of execution_ raises IPL to block
    |accesses *by other threads of execution* running on the *same*
    |processor.

    Emphasis added. Note that the primitives that do the acquiring
    and holding are threads.

    (Note, I already quoted this text from [Gol94] in an earlier
    response; page 255.)

    And I've tried to get you to stop treating the terms "thread of execution" >and "kernel thread" as equivalent. They are not.

    Yes, and what I've tried to explain to you, over and over, is
    that in this context it really doesn't matter. A "kernel
    thread" *can run in kernel mode*.

    It says "thread of execution" which as I have explained is not a
    necessarily a kernel thread (or in this case a VMS process).

    This is well-known. It is also irrelevant. A processor, or
    core, or whatever, is simply not a "thread of execution"; it is
    the thing that "threads of execution" run on. You are trying to
    ascribe to a physical object the attributes of a logical entity.

    It could be an interrupt or a Dpc, or code triggered as a consequence
    of actions of any of those which has no connection to whichever
    random kernel thread was running on a processor at that time.

    All of those "threads of execution" could acquire a spinlock at their >associated IPL but only one of those is due to the current kernel thread.

    Oh, I see your confusion. The thing you seem to be missing is
    that all of those "threads of execution" _are running in the
    kernel in the context of some software thread_. Those software
    threads is the entity that owns a spinlock when it's acquired.
    In the case of Windows, VMS, etc, usually that's some "kernel
    thread", while it is is running in the kernel, but I would not
    be surprised if they had other kinds of dedicated software
    threads they used only in the kernel for special purposes.

    Even if that was a kernel thread that raised its IPL and performed
    some kernel action, when it tries to lower its IPL < dispatch,
    that can trigger the Dispatcher which may invoke the Scheduler
    and ThreadSwitcher. As that transition into Dispatcher might trigger
    a thread switch, it conceptually[1] dumps its thread register context
    in case a switch is needed.

    And all of those consequences which were not directly invoked by the
    current kernel thread but were triggered by it could acquire spinlocks.
    Those triggered "threads of execution" were set up by other
    "threads of execution", some of which may be kernel threads,
    but are unrelated to the currently executing kernel thread.

    On entry to the Dispatcher, the register set, the thread of execution, >conceptually[1] is no longer connected to whichever kernel thread
    was running on the processor. The private processor state information >maintains a reference to its current kernel thread whose context will be >restored when it exits the Dispatcher.

    [1] I say conceptually because with careful layering of the code for
    entering and leaving the Dispatcher/Scheduler/ThreadSwitcher the OS
    can defer saving some of the kernel thread's registers into the kernel
    thread context until the actual thread switch occurs, as long as it
    does not modify those registers. x64 FPU or SIMD registers might be examples.

    I'm sorry, I think you're confused here in a similar manner to
    how you were confused about the way system calls work.

    Rus12 pg 179: "The mechanism the kernel uses to achieve multiprocessor
    mutual exclusion is called a spinlock."

    Yes, I already pointed out that this is the usual case. But you
    are still ignoring the uniprocessor case: spinlocks must work
    there, as well.

    But more relevantly, this does not imply that _processors_ "own"
    spinlocks in the way you have asserted. That is done by
    threads. A thread does not magically "turn into" a processor
    when it acquires a spinlock; it remains a thread, albeit one
    that is now holding a spinlock that it wasn't before.

    I've pointed you at the sections of those books which repeatedly refers to >them as "processor spinlocks".

    No, you really didn't. You did point to things in the book that
    appeared to agree with you, but you cherry-picked those and
    ignored the larger context, where it invariably meant something
    else that _disagreed_ with your interpretation. Furthermore,
    you ignored anything that didn't already fit your preconception
    of how these things fit together.

    Rus12 on pg 181 even shows a WNT kernel
    debugger command in "EXPERIMENT: Viewing Global Queued Spinlocks" which
    *displays a list of queued spinlocks held by each processor*.
    That list is stored in the Processor Control Region PCR,
    a per-processor status block WNT maintains.

    If that is not enough to convince you, nothing will.

    Here is another example of cherry-picking out of context. The
    reference here is to queuing spin locks, where Windows tries to
    avoid scalability collapse due to cache-coherency overhead when
    threads running on different processors race on acquiring a
    heavily contended (spin)lock. The context is obvious in when
    one considers the surrounding text in addition to that sentence
    fragment: they are clearly talking about concurrent threads
    running on different processors racing in parallel.

    As an aisde, I already pointed to a paper about th topic of
    non-scalable locks in the context of Linux (Boyd-Wickizer et
    al): http://pdos.csail.mit.edu/papers/linux:lock.pdf

    I've tried my best to explain it and pointed you to the text that supports >this but you remain unconvinced and this is just going around in circles
    and using up both of our time. As your interpretation works for you and
    mine works for me, we'll just have to agree to disagree.

    Probably for the best.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)