• Re: Interrupt enable down-count

    From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 19:23:03 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/29/2025 6:29 AM, Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.

    TradeoffsrCa

    I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

    ComplexrCa

    A simple alternative that I have seen is to have an instruction that
    enables interrupts and jumps to somewhere, probably either the
    interrupted code or the dispatcher that might do a full context switch.
    The ISR would issue this instruction when it has saved everything that
    is necessary to handle the interrupt and thus could be interrupted
    again. This minimized the time interrupts are locked out without the
    need for an arbitrary timer, etc.

    Another alternative is to allow ISRs to be interrupted by ISRs of higher priority. All you need here is a clean and precise definition of priority
    and when said priority gets associated with any given interrupt.

    My 66000 goes so far as to never need to disable interrupts because all interrupts of the same or lower priority are automatically disabled by
    the priority of the current ISR/running-thread. That is, one arrives
    at the ISR with interrupts enabled and in a reentrant state with the
    priority given by the I/O MMU when device sent ISR message to MSI-X
    queue.

    If/when an ISR needs to be sure it is not interrupted, it can change
    priority in 1 instruction to "highest" and have the system not allow
    the I/O MMU to associate said "exclusive" priority with any device
    interrupt. When ISR returns, priority reverts to priority at the time
    the interrupt was taken. {No need to back down on priority} This only
    requires that there are enough priorities to spare one exclusively to
    the system.

    EricP has argued that 8-I/O priority levels are enough. I argue that
    64 priority levels are enough for {Guest OS, Host OS, HyperVisor}
    to each have their own somewhat-coordinated structure of priorities.
    AND further I argue that given one is designing a 64-bit machine,
    that 64 priority levels are d|- rigueur.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 15:42:13 2025
    From Newsgroup: comp.arch

    On 2025-11-29 2:05 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
    delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred
    because interrupts got disabled by an instruction in the pipeline. I
    guessed 40 instructions would likely be enough for many cases where IRQs
    are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be
    committed because the IRQs got disabled in the meantime. If the CPU were
    allowed to accept another IRQ right away, it could get stuck in a loop
    flushing the pipeline and reloading with the ISR routine code instead of
    progressing through the code where IRQs were disabled.

    The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
    are allowed to retire (apace) and new instructions are inserted from
    the interrupt service point.

    That is how Qupls is working too. The issue is what happens when the instruction in the pipe before the ISR disables the interrupt. Then the
    ISR instructions need to be flushed.

    As long as the instructions "IN" the pipe
    can deliver their results to their registers, and update -|Architectural state they "own", there is no reason to flush--AND--no corresponding
    reason to delay "taking" the interrupt.

    That is the usual case for Qupls too when there is an interrupt.

    At the -|Architectural level, you, the designer, see both the front
    and the end of the pipeline, you can change what goes in the front
    and allow what was already in the pipe to come out the back. This
    requires dragging a small amount of information down the pipe, much
    like multi-threaded CPUs.

    Yes, the IRQ info is being dragged down the pipe.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU
    would advance in 40 instruction burps. Alternating between fetching ISR
    instructions and the desired instruction stream. On the other hand, a
    larger down-count starts to impact the IRQ latency.

    TradeoffsrCa

    I suppose I could have the CPU increase the down-count if it is looping
    around fetching ISR instructions. The down-count would be reset to the
    minimum again once an interrupt enable instruction is executed.

    ComplexrCa

    Make the problem "go away". You will be happier in the end.

    The interrupt mask is set at fetch time to disable lower priority
    interrupts. I suppose disabling of interrupts by the OS could simply be ignored. The interrupt could only be taken if it is a higher priority
    than the current level.

    I had thought the OS might have good reason to disable interrupts. But
    maybe I am making things too complex.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Nov 29 16:10:45 2025
    From Newsgroup: comp.arch

    Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
    guessed 40 instructions would likely be enough for many cases where IRQs
    are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
    larger down-count starts to impact the IRQ latency.

    TradeoffsrCa

    I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

    ComplexrCa


    You are using this timer to predict the delay for draining the pipeline.
    It would only take a read of a slow IO device register to exceed it.

    I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the
    pipeline to drain the old stream before accepting the interrupt and
    redirecting Fetch to its handler. That way if there are and interrupt
    enable or disable instructions, or branch mispredicts, or pending exceptions in-flight they all are allowed to finish and the state to settle down.

    Pipelining interrupt delivery looks possible but gets complicated and
    expensive real quick.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 22:17:36 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-29 2:05 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
    delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred
    because interrupts got disabled by an instruction in the pipeline. I
    guessed 40 instructions would likely be enough for many cases where IRQs >> are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be >> committed because the IRQs got disabled in the meantime. If the CPU were >> allowed to accept another IRQ right away, it could get stuck in a loop
    flushing the pipeline and reloading with the ISR routine code instead of >> progressing through the code where IRQs were disabled.

    The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
    are allowed to retire (apace) and new instructions are inserted from
    the interrupt service point.

    That is how Qupls is working too. The issue is what happens when the instruction in the pipe before the ISR disables the interrupt. Then the
    ISR instructions need to be flushed.

    As a general rule of thumb:: an instruction is not "performed" until
    after it retires. {when you cannot undo its deeds}

    Consider the case where you redirect the front of the pipe to an ISR and
    an instruction already in the pipe raises an exception. Here, what I do
    {and have done in the past} is to not retire instructions after the
    exception, so the ISR is not delayed and IP ends up pointing at the
    excepting instruction.

    Since you started ISR before you retired DI, you can treat DI as an
    exception. {DI after ISR control transfer}. If, on the other hand,
    you perform DI at the front of the pipe, you don't "accept" the ISR
    until EI.

    As long as the instructions "IN" the pipe
    can deliver their results to their registers, and update -|Architectural state they "own", there is no reason to flush--AND--no corresponding
    reason to delay "taking" the interrupt.

    That is the usual case for Qupls too when there is an interrupt.

    At the -|Architectural level, you, the designer, see both the front
    and the end of the pipeline, you can change what goes in the front
    and allow what was already in the pipe to come out the back. This
    requires dragging a small amount of information down the pipe, much
    like multi-threaded CPUs.

    Yes, the IRQ info is being dragged down the pipe.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU
    would advance in 40 instruction burps. Alternating between fetching ISR
    instructions and the desired instruction stream. On the other hand, a
    larger down-count starts to impact the IRQ latency.

    TradeoffsrCa

    I suppose I could have the CPU increase the down-count if it is looping
    around fetching ISR instructions. The down-count would be reset to the
    minimum again once an interrupt enable instruction is executed.

    ComplexrCa

    Make the problem "go away". You will be happier in the end.

    The interrupt mask is set at fetch time to disable lower priority interrupts. I suppose disabling of interrupts by the OS could simply be ignored. The interrupt could only be taken if it is a higher priority
    than the current level.

    I had thought the OS might have good reason to disable interrupts. But
    maybe I am making things too complex.


    The OS DOES have good reasons to DI "every once in a while", IIRC my conversations with EricP, these are short sequences the OS needs
    to be ATOMIC across all OS threads--and almost always without the
    possibility that the ATOMIC event fails {which can happen in user code}.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 22:26:21 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.

    TradeoffsrCa

    I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

    ComplexrCa


    You are using this timer to predict the delay for draining the pipeline.
    It would only take a read of a slow IO device register to exceed it.

    Yes, exactly::

    Consider a GBOoO processor that performs a LD R9,[deviceCR].

    a) all earlier memory references have to be seen globally
    ...before this LD can be seen globally. {dozens of cycles}
    b) this LD has to arrive at HostBridge. {dozens of cycles}
    c) HostBrdge sends request down PCIe {hundreds of cycles}
    d) device responds to LD {handful of cycles}
    e) PCIe transports response to HB {hundreds of cycles}
    f) HB transfers response to requestor {dozens of cycles}
    g) CPU is allowed to re-enter OoO {handful of cycles}

    Accesses to devices need to have most of the properties of
    "Sequential Consistency" as defined by Lamport.

    Now, several LDs [DeviceCRs] can be seen globally and in order
    before the first (or all responses) but you are going to see all
    that latency in the pipeline; but OoO memory requests are not one
    of them.

    I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
    enable or disable instructions, or branch mispredicts, or pending exceptions in-flight they all are allowed to finish and the state to settle down.

    Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 17:45:17 2025
    From Newsgroup: comp.arch

    On 2025-11-29 4:10 p.m., EricP wrote:
    Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
    count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being
    deferred because interrupts got disabled by an instruction in the
    pipeline. I guessed 40 instructions would likely be enough for many
    cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not
    be committed because the IRQs got disabled in the meantime. If the CPU
    were allowed to accept another IRQ right away, it could get stuck in a
    loop flushing the pipeline and reloading with the ISR routine code
    instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the
    CPU would advance in 40 instruction burps. Alternating between
    fetching ISR instructions and the desired instruction stream. On the
    other hand, a larger down-count starts to impact the IRQ latency.

    TradeoffsrCa

    I suppose I could have the CPU increase the down-count if it is
    looping around fetching ISR instructions. The down-count would be
    reset to the minimum again once an interrupt enable instruction is
    executed.

    ComplexrCa


    You are using this timer to predict the delay for draining the pipeline.
    It would only take a read of a slow IO device register to exceed it.

    The down count is counting down only when the front-end of the pipeline advances, instructions are sure to be loaded.

    I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
    enable or disable instructions, or branch mispredicts, or pending
    exceptions
    in-flight they all are allowed to finish and the state to settle down.

    Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.



    The base down count increases every time the IRQ is found at the commit
    stage. If the base down count is too large (stuck interrupt) then an
    exception is processed. For instance if interrupts were disabled for
    1000 clocks.

    I think the mechanism could work, complicated though.

    Treating the DI as an exception, as mentioned in another post would also
    work. It is a matter then of flushing the instructions between the DI
    and ISR.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 23:14:23 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-29 4:10 p.m., EricP wrote:
    Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
    count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being
    deferred because interrupts got disabled by an instruction in the
    pipeline. I guessed 40 instructions would likely be enough for many
    cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not
    be committed because the IRQs got disabled in the meantime. If the CPU
    were allowed to accept another IRQ right away, it could get stuck in a
    loop flushing the pipeline and reloading with the ISR routine code
    instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the
    CPU would advance in 40 instruction burps. Alternating between
    fetching ISR instructions and the desired instruction stream. On the
    other hand, a larger down-count starts to impact the IRQ latency.

    TradeoffsrCa

    I suppose I could have the CPU increase the down-count if it is
    looping around fetching ISR instructions. The down-count would be
    reset to the minimum again once an interrupt enable instruction is
    executed.

    ComplexrCa


    You are using this timer to predict the delay for draining the pipeline.
    It would only take a read of a slow IO device register to exceed it.

    The down count is counting down only when the front-end of the pipeline advances, instructions are sure to be loaded.

    I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt enable or disable instructions, or branch mispredicts, or pending exceptions
    in-flight they all are allowed to finish and the state to settle down.

    Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.



    The base down count increases every time the IRQ is found at the commit stage. If the base down count is too large (stuck interrupt) then an exception is processed. For instance if interrupts were disabled for
    1000 clocks.

    I think the mechanism could work, complicated though.

    Treating the DI as an exception, as mentioned in another post would also work. It is a matter then of flushing the instructions between the DI
    and ISR.

    Which is no different than flushing instructions after a mispredicted branch. --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 29 23:37:21 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    (Looking at your
    code, it also does not seem to be self-sufficient, at least the
    numerous SKIP4 statements require something else).

    If you want to assemble the resulting .S file, it's assembled once
    with

    -DSKIP4= -Dgforth_engine2=gforth_engine

    and once with

    -DSKIP4=".skip 4"

    (on Linux-GNU AMD64, the .skip assembler directive is autoconfigured
    and may be different on other platforms).

    My assumption is that the control flow is confusing gcc.

    My guess is the same.

    Both our guesses were wrong, and Scott (I think) was on the right
    track - this is a signed / unsigned issue. A reduced test case is

    void bar(unsigned long, long);

    void foo(unsigned long u1)
    {
    long u3;
    u1 = u1 / 10;
    u3 = u1 % 10;
    bar(u1,u3);
    }

    This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 02:17:10 2025
    From Newsgroup: comp.arch

    On 2025-11-29 6:14 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-29 4:10 p.m., EricP wrote:
    Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
    count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being
    deferred because interrupts got disabled by an instruction in the
    pipeline. I guessed 40 instructions would likely be enough for many
    cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not
    be committed because the IRQs got disabled in the meantime. If the CPU >>>> were allowed to accept another IRQ right away, it could get stuck in a >>>> loop flushing the pipeline and reloading with the ISR routine code
    instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the
    CPU would advance in 40 instruction burps. Alternating between
    fetching ISR instructions and the desired instruction stream. On the
    other hand, a larger down-count starts to impact the IRQ latency.

    TradeoffsrCa

    I suppose I could have the CPU increase the down-count if it is
    looping around fetching ISR instructions. The down-count would be
    reset to the minimum again once an interrupt enable instruction is
    executed.

    ComplexrCa


    You are using this timer to predict the delay for draining the pipeline. >>> It would only take a read of a slow IO device register to exceed it.

    The down count is counting down only when the front-end of the pipeline
    advances, instructions are sure to be loaded.

    I was thinking a simple and cheap way would be to use a variation of the >>> single-step mechanism. An interrupt request would cause Decode to emit a >>> special uOp with the single-step flag set and then stall, to allow the
    pipeline to drain the old stream before accepting the interrupt and
    redirecting Fetch to its handler. That way if there are and interrupt
    enable or disable instructions, or branch mispredicts, or pending
    exceptions
    in-flight they all are allowed to finish and the state to settle down.

    Pipelining interrupt delivery looks possible but gets complicated and
    expensive real quick.



    The base down count increases every time the IRQ is found at the commit
    stage. If the base down count is too large (stuck interrupt) then an
    exception is processed. For instance if interrupts were disabled for
    1000 clocks.

    I think the mechanism could work, complicated though.

    Treating the DI as an exception, as mentioned in another post would also
    work. It is a matter then of flushing the instructions between the DI
    and ISR.

    Which is no different than flushing instructions after a mispredicted branch.

    Got fed up with trying to work out how get interrupts working. It turns
    out to be more challenging than I expected, no matter which way it is
    done. So, I decided to just poll for interrupts, getting rid of most of
    the IRQ logic. I added a branch-on-interrupt BOI instruction that works
    almost the same way as every other branch. Then the micro-op translator
    has been adapted to insert a polling branch periodically. It looks a lot simpler.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 10:10:00 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:

    Got fed up with trying to work out how get interrupts working. It turns
    out to be more challenging than I expected, no matter which way it is
    done. So, I decided to just poll for interrupts, getting rid of most of
    the IRQ logic. I added a branch-on-interrupt BOI instruction that works almost the same way as every other branch. Then the micro-op translator
    has been adapted to insert a polling branch periodically. It looks a lot simpler.

    What is the expected delay until an interrupt is delivered?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 06:29:55 2025
    From Newsgroup: comp.arch

    On 2025-11-30 5:10 a.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:

    Got fed up with trying to work out how get interrupts working. It turns
    out to be more challenging than I expected, no matter which way it is
    done. So, I decided to just poll for interrupts, getting rid of most of
    the IRQ logic. I added a branch-on-interrupt BOI instruction that works
    almost the same way as every other branch. Then the micro-op translator
    has been adapted to insert a polling branch periodically. It looks a lot
    simpler.

    What is the expected delay until an interrupt is delivered?

    I set the timing to 16 clocks which is about 64 (or more) instructions.
    Did not want to go much over 1% the number of instructions executed.
    Not every instruction inserts a poll, so sometimes a poll is lacking.
    IDK how well it will work. Making it an instruction means it might also
    be used by software.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 06:41:52 2025
    From Newsgroup: comp.arch

    On 2025-11-30 6:29 a.m., Robert Finch wrote:
    On 2025-11-30 5:10 a.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:

    Got fed up with trying to work out how get interrupts working. It turns
    out to be more challenging than I expected, no matter which way it is
    done. So, I decided to just poll for interrupts, getting rid of most of
    the IRQ logic. I added a branch-on-interrupt BOI instruction that works
    almost the same way as every other branch. Then the micro-op translator
    has been adapted to insert a polling branch periodically. It looks a lot >>> simpler.

    What is the expected delay until an interrupt is delivered?

    I set the timing to 16 clocks which is about 64 (or more) instructions.
    Did not want to go much over 1% the number of instructions executed.
    Not every instruction inserts a poll, so sometimes a poll is lacking.
    IDK how well it will work. Making it an instruction means it might also
    be used by software.

    Might be able to modify the branch predictor to predict the interrupt.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 14:14:16 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Both our guesses were wrong, and Scott (I think) was on the right
    track - this is a signed / unsigned issue. A reduced test case is

    void bar(unsigned long, long);

    void foo(unsigned long u1)
    {
    long u3;
    u1 = u1 / 10;
    u3 = u1 % 10;
    bar(u1,u3);
    }

    Assigning to u1 changed the meaning, as Andrew Pinski noted; so the
    jury is still out on what the actual problem is.

    This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .

    and a revised one at
    <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122919>

    (The announced attachment is not there yet.)

    The latter case is interesting, because real_ca and spc became global,
    and symbols[] is still local, and no assignment to real_ca happens
    inside foo().

    So one way the compiler could interpret this code might be that
    real_ca gets one of the labels whose address is taken in some way
    unknown to the compiler; the it has to preserve all the code reachable
    through the labels.

    Another way to interpret this code would be that symbols is not used,
    so it is dead and can be optimized away. Consequently, none of the
    addresses of any of the labels is ever taken, and the labels are not
    used by direct jumps, either, so all the code reachable only by
    jumping to the labels is unreachable and can be optimized away.

    Apparently gcc takes the latter attitude if there are <=100 labels in
    symbols, but maybe something like the former attitude if there are
    100 labels in symbols. This may appear strange, but gcc generally
    tends to produce good code in relatively short time for Gforth (while
    clang generates horribly slow code and takes extremely long in doing
    so), and my guess is that having such a cutoff on doing the usual
    analysis has something to do with gcc's superior performance.

    I guess that if you treat symbols like in the original code (i.e.,
    return it in one case), you can reduce the labels more without the
    compiler optimizing everything away. I don't dare to predict when the
    compiler will stop generating the inefficient variant. Maybe it has
    to do with the cutoff.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 15:47:03 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Both our guesses were wrong, and Scott (I think) was on the right
    track - this is a signed / unsigned issue. A reduced test case is

    void bar(unsigned long, long);

    void foo(unsigned long u1)
    {
    long u3;
    u1 = u1 / 10;
    u3 = u1 % 10;
    bar(u1,u3);
    }

    Assigning to u1 changed the meaning, as Andrew Pinski noted;

    An example which could be tested at run-time to verify correct
    operation was not provided, so I had to do without.

    In reducing compiler bugs, automated tools such as delta or
    (much better) cvise are essential. Your test case was so
    large that cvise failed, so a lot of manual work was required.

    cvise uses a user-supplied "interestingness script" which returns
    0 if the feature in question is there, or non-zero if it is
    not there. For relatively simple cases like an ICE, it
    can have two steps: a) check that compilation fails, and b)
    check that the error messages is output.

    Looking for a missed optimization is more difficult, especially
    in the absence of a run-time test. It is then necessary to

    a) check the source code that the interesting code is still there

    b) compile the code (exiting if this fails)

    c) verify the generated assembly that it still does the same

    a) and c) are very easy to get wrong, and there were numerous
    false reductions where cvise came up with something that the
    scripts didn't catch.


    so the
    jury is still out on what the actual problem is.

    This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .

    and a revised one at
    <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122919>

    (The announced attachment is not there yet.)

    The latter case is interesting, because real_ca and spc became global,
    and symbols[] is still local, and no assignment to real_ca happens
    inside foo().

    That is what cvise does. It sometimes reduces code more than a
    human would.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 15:18:21 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Z|+rich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    So they can properly simluate a pipelined processor?

    Sure, if a CS graduate works in an application area, they need to
    learn about that application area, whatever it is.

    It's useful for code optimization, as well.

    In what way?

    In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.

    Certainly. But do they need to know between a a Wallace multiplier
    and a Dadda multiplier? If not, what is it about pipelined processors
    that would require CS graduates to know about them?

    Processor pipelines are not the basics of what a CS graduate is doing.
    They are an implementation detail in computer engineering.

    Which affect the performance of the software created by the
    software engineer (CS graduate).

    By a constant factor; and the software creator does not need to know
    that the CPU that executes instructions at 2 CPI (486) instead of at
    10 CPI (VAX-11/780) is pipelined; and these days both the 486 and the
    VAX are irrelevant to software creators.

    A few more examples where compilers are not as good as even I expected:

    Just today, I compiled

    u4 = u1/10;
    u3 = u1%10;

    (plus some surrounding code) with gcc-14 in three contexts. Here's
    the code for two of them (the third one is similar to the second one):

    movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
    sub $0x8,%r13 mov %r8,%rax
    mul %r8 mov %r8,%rcx
    mov %rdx,%rax mul %rsi
    shr $0x3,%rax shr $0x3,%rdx
    lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
    add %rdx,%rdx add %rax,%rax
    sub %rdx,%r8 sub %rax,%r8
    mov %r8,0x8(%r13) mov %rcx,%rax
    mov %rax,%r8 mul %rsi
    shr $0x3,%rdx
    mov %rdx,%r9

    The major difference is that in the left context, u3 is stored into
    memory (at 0x8(%r13)), while in the right context, it stays in a
    register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
    again.

    Sort of emphasizes that programmers need to understand the
    underlying hardware.

    I am the programmer of the code shown above. In what way would better knowledge of the hardware made me aware that gcc would produce
    suboptimal code in some cases?

    What were u1, u3 and u4 declared as?

    unsigned long (on that platform).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 16:39:41 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    In reducing compiler bugs, automated tools such as delta or
    (much better) cvise are essential. Your test case was so
    large that cvise failed, so a lot of manual work was required.

    I have now done a manual reduction myself; essentially I left only the
    3 variants of the VM instruction that performs 10/, plus all the
    surroundings, and I added code to ensure that spTOS, spb, and spc are
    not dead. You find the result at

    http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

    The result of compiling this with

    gcc -I./../arch/amd64 -I. -Wall -g -O2 -fomit-frame-pointer -pthread -DHAVE_CONFIG_H -DFORCE_LL -DFORCE_REG -DDEFAULTPATH='".:/usr/local/lib/gforth/site-forth:/usr/local/lib/gforth/0.7.9_20251119:/usr/local/share/gforth/0.7.9_20251119:/usr/share/gforth/site-forth:/usr/local/share/gforth/site-forth"' -c -fno-gcse -fcaller-saves -fno-defer-pop -fno-inline -fwrapv -fno-strict-aliasing -fno-cse-follow-jumps -fno-reorder-blocks -fno-reorder-blocks-and-partition -fno-toplevel-reorder -falign-labels=1 -falign-loops=1 -falign-jumps=1 -fno-delete-null-pointer-checks -fcf-protection=none -fno-tree-vectorize -fno-lto -pthread -DENGINE=2 -fPIC -DPIC -o libengine-fast2-ll-reg-red.S -S engine-fast-red.i

    can be found at

    http://www.complang.tuwien.ac.at/anton/tmp/libengine-fast2-ll-reg-red.S

    Now the multiplier is permanently allocated to %r11, so searching for
    it won't help. However, if you search for "mulq", you will find the
    code generated for the three instances of the VM instruction. The
    first is optimized well, the second exhibits two mulqs and two shrqs,
    the third exhibits just one mulq, but two shrqs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 18:59:15 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    In reducing compiler bugs, automated tools such as delta or
    (much better) cvise are essential. Your test case was so
    large that cvise failed, so a lot of manual work was required.

    I have now done a manual reduction myself; essentially I left only the
    3 variants of the VM instruction that performs 10/, plus all the surroundings, and I added code to ensure that spTOS, spb, and spc are
    not dead. You find the result at

    http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

    Do you have an example which tests the codepath taken for the
    offending piece of code, so it is possible to further reduce this
    case automatically? The example is still quite big (>13000 lines).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 30 19:33:47 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Z|a-+rich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    So they can properly simluate a pipelined processor?

    Sure, if a CS graduate works in an application area, they need to
    learn about that application area, whatever it is.

    It's useful for code optimization, as well.

    In what way?

    In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.

    Certainly. But do they need to know between a a Wallace multiplier
    and a Dadda multiplier?

    You do realize that all Wallace multipliers are Dadda multipliers ??
    But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

    If not, what is it about pipelined processors
    that would require CS graduates to know about them?

    How execution order disturbs things like program order and memory order.
    That is how and when they need to insert Fences in their multi-threaded
    code.

    Processor pipelines are not the basics of what a CS graduate is doing. >>They are an implementation detail in computer engineering.

    Which affect the performance of the software created by the
    software engineer (CS graduate).

    By a constant factor; and the software creator does not need to know
    that the CPU that executes instructions at 2 CPI (486) instead of at
    10 CPI (VAX-11/780) is pipelined; and these days both the 486 and the
    VAX are irrelevant to software creators.

    I do not believe that the word "the" in front of x86 or VAX is proper.

    A few more examples where compilers are not as good as even I expected:

    Just today, I compiled

    u4 = u1/10;
    u3 = u1%10;

    (plus some surrounding code) with gcc-14 in three contexts. Here's
    the code for two of them (the third one is similar to the second one):

    movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
    sub $0x8,%r13 mov %r8,%rax
    mul %r8 mov %r8,%rcx
    mov %rdx,%rax mul %rsi
    shr $0x3,%rax shr $0x3,%rdx
    lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
    add %rdx,%rdx add %rax,%rax
    sub %rdx,%r8 sub %rax,%r8
    mov %r8,0x8(%r13) mov %rcx,%rax
    mov %rax,%r8 mul %rsi
    shr $0x3,%rdx
    mov %rdx,%r9

    The major difference is that in the left context, u3 is stored into >>memory (at 0x8(%r13)), while in the right context, it stays in a >>register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
    again.

    Sort of emphasizes that programmers need to understand the
    underlying hardware.

    I am the programmer of the code shown above. In what way would better knowledge of the hardware made me aware that gcc would produce
    suboptimal code in some cases?

    Reading and thinking about the asm-code and running the various code
    sequences enough times that you can measure which is better and which
    is worse. That is the engineering part of software Engineering.

    What were u1, u3 and u4 declared as?

    unsigned long (on that platform).

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Sun Nov 30 22:38:39 2025
    From Newsgroup: comp.arch

    On 2025-11-30 21:33, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Z|a-+rich had heard about >>>>>>>> pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    So they can properly simluate a pipelined processor?

    Sure, if a CS graduate works in an application area, they need to
    learn about that application area, whatever it is.

    It's useful for code optimization, as well.

    In what way?

    In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.

    Certainly. But do they need to know between a a Wallace multiplier
    and a Dadda multiplier?

    You do realize that all Wallace multipliers are Dadda multipliers ??
    But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

    If not, what is it about pipelined processors
    that would require CS graduates to know about them?

    How execution order disturbs things like program order and memory order.
    That is how and when they need to insert Fences in their multi-threaded
    code.

    That is an aspect of processor architecture that is relevant to some programmers, but not to the large number of programmers who use
    languages or operating systems with built-in multi-threading and safe inter-thread communication primitives and services for input/output.

    I am the programmer of the code shown above. In what way would better
    knowledge of the hardware made me aware that gcc would produce
    suboptimal code in some cases?

    Reading and thinking about the asm-code and running the various code sequences enough times that you can measure which is better and which
    is worse. That is the engineering part of software Engineering.

    That is a very niche part of software (performance) engineering. Speed
    of execution is only one of many "goodness" dimensions of a piece of SW, others including correctness, reliability, security, portability, maintainability, and so on. All dimensions need and depend on systematic engineering, although some dimensions cannot be quantified as easily as execution speed.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 22:11:26 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    In reducing compiler bugs, automated tools such as delta or
    (much better) cvise are essential. Your test case was so
    large that cvise failed, so a lot of manual work was required.

    I have now done a manual reduction myself; essentially I left only the
    3 variants of the VM instruction that performs 10/, plus all the
    surroundings, and I added code to ensure that spTOS, spb, and spc are
    not dead. You find the result at

    http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

    Do you have an example which tests the codepath taken for the
    offending piece of code,

    Not easily.

    so it is possible to further reduce this
    case automatically? The example is still quite big (>13000 lines).

    Most of which is coming from including stdlib.h etc. The actual code
    of the gforth_engine function in that example is 264 lines, many of
    which are empty or line number indicators.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 22:17:19 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

    scott@slp53.sl.home (Scott Lurndal) writes:
    In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.

    Certainly. But do they need to know between a a Wallace multiplier
    and a Dadda multiplier?

    You do realize that all Wallace multipliers are Dadda multipliers ??
    But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

    Good to know, but does not answer the question.

    If not, what is it about pipelined processors
    that would require CS graduates to know about them?

    How execution order disturbs things like program order and memory order.
    That is how and when they need to insert Fences in their multi-threaded >code.

    And the relevance of pipelined processors for that issue is what?

    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude. If you implement per-CPU caches and multiple memory controllers as shoddily
    as possible while providing features for programs to slow themselves
    down heavily in order to get memory-ordering guarantess, then you get
    a weak memory model; slightly less shoddy, and you get a "strong" memory
    model. Processor pipelines have no relevance here.

    And, as Niklas Holsti observed, dealing with memory-ordering
    shenanigans is something that a few specialists do; no need for others
    to know about the memory model, except that common CPUs unfortunately
    do not implement sequential consistency.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 00:12:15 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

    scott@slp53.sl.home (Scott Lurndal) writes:
    In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.

    Certainly. But do they need to know between a a Wallace multiplier
    and a Dadda multiplier?

    You do realize that all Wallace multipliers are Dadda multipliers ??
    But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

    Good to know, but does not answer the question.

    {Without contradicting that Wallace got on the correct track first}
    Wallace gets the credit that should rightly go to Dadda.

    If not, what is it about pipelined processors
    that would require CS graduates to know about them?

    How execution order disturbs things like program order and memory order. >That is how and when they need to insert Fences in their multi-threaded >code.

    And the relevance of pipelined processors for that issue is what?

    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.

    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    If you implement per-CPU caches and multiple memory controllers as shoddily
    as possible while providing features for programs to slow themselves
    down heavily in order to get memory-ordering guarantess, then you get
    a weak memory model; slightly less shoddy, and you get a "strong" memory model. Processor pipelines have no relevance here.

    It is the pipelines themselves (along with the SuperComputer attitude)
    that gives rise to the weak memory models.

    And, as Niklas Holsti observed, dealing with memory-ordering
    shenanigans is something that a few specialists do; no need for others
    to know about the memory model, except that common CPUs unfortunately
    do not implement sequential consistency.

    Because of the SuperComputer attitude ! {Performance first}

    And only after several languages built their own ATOMIC primitives, so
    the programmers could remain ignorant. But this also ties the hands of
    the designers in such a way that performance grows ever more slowly
    with more threads.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Dec 1 07:56:37 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.

    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    Wrong. The supercomputer attitude gave us such wonders as IA-64
    (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
    only easier to program, but also faster.

    The advocates of weaker memory models justify them by pointing to the
    slowness of sequential consistency if one implements it by using
    fences on hardware optimized for a weaker memory model. But that's
    not the way to implement efficient sequential consistency.

    In an alternate reality where AMD64 did not happen and IA-64 won,
    people would justify the IA-64 ISA complexity as necessary for
    performance, and claim that the IA-32 hardware in the Itanium
    demonstrates the performance superiority of the EPIC approach, just
    like they currently justify the performance superiority of weak and
    "strong" memory models over sequential consistency.

    If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong"
    ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the
    slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).

    A similar case: Alpha includes a trapb instruction (an exception
    fence). Programmers have to insert it after FP instructions to get
    precise exceptions. This was justified with performance; i.e., the
    theory went: If you compile without trapb, you get performance and
    imprecise exceptions, if you compile with trapb, you get slowness and
    precise exceptions. I then measured SPEC 95 compiled without and with
    trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
    there was hardly any difference; I believe that trapb is a noop on the
    21264. Here's the SPECfp_base95 numbers:

    with without
    trapb trapb
    9.56 11.6 AlphaPC164LX 600MHz 21164A
    19.7 20.0 Compaq XP1000 500MHz 21264

    So the machine that needs trapb is much slower even without trapb than
    even the with-trapb variant on the machine where trapb is probably a
    noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

    And only after several languages built their own ATOMIC primitives, so
    the programmers could remain ignorant. But this also ties the hands of
    the designers in such a way that performance grows ever more slowly
    with more threads.

    Maybe they could free their hands by designing for a
    sequential-consistency interface, just like designing for a simple sequential-execution model without EPIC features freed their hands to
    design microarchitectural features that allowed ordinary code to
    utilize wider and wider OoO cores profitably.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Dec 1 13:23:22 2025
    From Newsgroup: comp.arch

    On Mon, 01 Dec 2025 07:56:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.

    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    Wrong. The supercomputer attitude gave us such wonders as IA-64
    (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
    only easier to program, but also faster.

    The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
    fences on hardware optimized for a weaker memory model. But that's
    not the way to implement efficient sequential consistency.

    In an alternate reality where AMD64 did not happen and IA-64 won,
    people would justify the IA-64 ISA complexity as necessary for
    performance, and claim that the IA-32 hardware in the Itanium
    demonstrates the performance superiority of the EPIC approach, just
    like they currently justify the performance superiority of weak and
    "strong" memory models over sequential consistency.

    If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).


    Where does sequential consistency simplifies programming over x86 model
    of "TCO + globally ordered synchronization primitives +
    every synchronization primitives have implied barriers"?

    More so, where it simplifies over ARMv8.1-A, assuming that programmer
    does not try to be too smart and never uses LL/SC and always uses
    8.1-style synchronization instructions with Acquire+Release flags set?

    IMHO, the only simple thing about sequential consistency is simple
    description. Other than that, it simplifies very little. It does not
    magically make lockless multithreaded programming bearable to
    non-genius coders.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Dec 1 14:07:34 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.
    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    Wrong. The supercomputer attitude gave us such wonders as IA-64
    (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
    only easier to program, but also faster.

    The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
    fences on hardware optimized for a weaker memory model. But that's
    not the way to implement efficient sequential consistency.

    In an alternate reality where AMD64 did not happen and IA-64 won,
    people would justify the IA-64 ISA complexity as necessary for
    performance, and claim that the IA-32 hardware in the Itanium
    demonstrates the performance superiority of the EPIC approach, just
    like they currently justify the performance superiority of weak and
    "strong" memory models over sequential consistency.

    If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).

    A similar case: Alpha includes a trapb instruction (an exception
    fence). Programmers have to insert it after FP instructions to get
    precise exceptions. This was justified with performance; i.e., the
    theory went: If you compile without trapb, you get performance and
    imprecise exceptions, if you compile with trapb, you get slowness and
    precise exceptions. I then measured SPEC 95 compiled without and with
    trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
    there was hardly any difference; I believe that trapb is a noop on the
    21264. Here's the SPECfp_base95 numbers:

    with without
    trapb trapb
    9.56 11.6 AlphaPC164LX 600MHz 21164A
    19.7 20.0 Compaq XP1000 500MHz 21264

    So the machine that needs trapb is much slower even without trapb than
    even the with-trapb variant on the machine where trapb is probably a
    noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

    The 21264 Hardware Reference Manual says TRAPB (general exception barrier)
    and EXCB (floating point control register barrier) are both NOP's
    internally, are tossed at decode, and don't even take up an
    instruction slot.

    The purpose of the EXCB is to synchronize pipeline access to the
    floating point control and status register with FP operations.
    In the worst case this stalls until the pipeline drains.

    I wonder how much logic it really saved allowing imprecise exceptions
    in the InO 21064 and 21164? Conversely, how much did it cost to deal
    with problems caused by leaving these interlocks off?

    The cores have multiple, parallel pipelines for int, lsq, fadd and fmul. Without exception interlocks, each pipeline only obeys the scoreboard
    rules for when to writeback its result register: WAW and WAR.
    That allows a younger, faster instruction to finish and write its register before an older, slower instruction. If that older instruction then throws
    an exception and does not write its register then we can see the out of
    order register writes.

    For register file writes to be precise in the presence of exceptions
    requires each instruction look ahead at the state of all older
    instructions *in all pipelines*.
    Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.
    A writeback can occur if there are no WAW or WAR dependencies,
    and all older uOps are Resolved_Normal.

    Just off the top of my head, in addition to the normal scoreboard,
    a FIFO buffer with a priority selector could be used to look ahead
    at all older uOps and check their status, and allow or stall uOp
    writebacks and ensure registers always appear precise.
    Which really doesn't look that expensive.

    Is there something I missed, or would that FIFO suffice?



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 22:50:15 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.

    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    Wrong. The supercomputer attitude gave us such wonders as IA-64
    (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
    only easier to program, but also faster.

    The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
    fences on hardware optimized for a weaker memory model. But that's
    not the way to implement efficient sequential consistency.

    In an alternate reality where AMD64 did not happen and IA-64 won,
    people would justify the IA-64 ISA complexity as necessary for
    performance, and claim that the IA-32 hardware in the Itanium
    demonstrates the performance superiority of the EPIC approach, just
    like they currently justify the performance superiority of weak and
    "strong" memory models over sequential consistency.

    If hardware designers put their mind to it, they could make sequential consistency perform well,

    Depends on your definition of SC and "performs well", but see below::

    probably better on code that actually
    accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).

    In the case of My 66000, there is a slightly weak memory model
    (Causal consistency) for accesses to DRAM, and there is Sequential
    consistency for ATOMIC stuff and device control registers, and then
    there is strongly ordered for configuration space access, and the
    programmer does not have to do "jack" to get these orderings--
    its all programmed in the PTEs.

    {{There is even a way to make DRAM accesses SC should you want.}}

    A similar case: Alpha includes a trapb instruction (an exception
    fence). Programmers have to insert it after FP instructions to get
    precise exceptions. This was justified with performance; i.e., the
    theory went: If you compile without trapb, you get performance and
    imprecise exceptions, if you compile with trapb, you get slowness and
    precise exceptions. I then measured SPEC 95 compiled without and with
    trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
    there was hardly any difference; I believe that trapb is a noop on the
    21264. Here's the SPECfp_base95 numbers:

    with without
    trapb trapb
    9.56 11.6 AlphaPC164LX 600MHz 21164A
    moderate slowdown
    19.7 20.0 Compaq XP1000 500MHz 21264
    slowdown has disappeared.

    So the machine that needs trapb is much slower even without trapb than
    even the with-trapb variant on the machine where trapb is probably a
    noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

    And only after several languages built their own ATOMIC primitives, so
    the programmers could remain ignorant. But this also ties the hands of
    the designers in such a way that performance grows ever more slowly
    with more threads.

    Maybe they could free their hands by designing for a
    sequential-consistency interface, just like designing for a simple sequential-execution model without EPIC features freed their hands to
    design microarchitectural features that allowed ordinary code to
    utilize wider and wider OoO cores profitably.

    That is not the property I was getting at--the property I was getting at
    is that the language model for synchronization can only use 1 memory
    location {TS, TTS, CAS, DCAS, LL, SC} and this fundamentally limits the
    amount of work one can do in a single event, and also fundamentally limits
    what one can "say" about a concurrent data structure.

    Given a certain amount of interference--the fewer ATOMIC things one has
    to do the lower the chance of interference, and the greater the chance
    of success. So, if one could move an element of a CDS from one location
    to another in one ATOMIC event rather than 2 (or 3) then the exponent
    of synchronization overhead goes down, and then one can make statements
    like "and no outside observer can see the CDS without that element present"--which cannot be stated with current models.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 23:03:24 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.
    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    Wrong. The supercomputer attitude gave us such wonders as IA-64
    (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
    only easier to program, but also faster.

    The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
    fences on hardware optimized for a weaker memory model. But that's
    not the way to implement efficient sequential consistency.

    In an alternate reality where AMD64 did not happen and IA-64 won,
    people would justify the IA-64 ISA complexity as necessary for
    performance, and claim that the IA-32 hardware in the Itanium
    demonstrates the performance superiority of the EPIC approach, just
    like they currently justify the performance superiority of weak and "strong" memory models over sequential consistency.

    If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).

    A similar case: Alpha includes a trapb instruction (an exception
    fence). Programmers have to insert it after FP instructions to get
    precise exceptions. This was justified with performance; i.e., the
    theory went: If you compile without trapb, you get performance and imprecise exceptions, if you compile with trapb, you get slowness and precise exceptions. I then measured SPEC 95 compiled without and with trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264 there was hardly any difference; I believe that trapb is a noop on the 21264. Here's the SPECfp_base95 numbers:

    with without
    trapb trapb
    9.56 11.6 AlphaPC164LX 600MHz 21164A
    19.7 20.0 Compaq XP1000 500MHz 21264

    So the machine that needs trapb is much slower even without trapb than
    even the with-trapb variant on the machine where trapb is probably a
    noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

    The 21264 Hardware Reference Manual says TRAPB (general exception barrier) and EXCB (floating point control register barrier) are both NOP's
    internally, are tossed at decode, and don't even take up an
    instruction slot.

    The purpose of the EXCB is to synchronize pipeline access to the
    floating point control and status register with FP operations.
    In the worst case this stalls until the pipeline drains.

    I wonder how much logic it really saved allowing imprecise exceptions
    in the InO 21064 and 21164?

    Having done something similar in Mc 88100, I can state that the amount
    of logic saved is too small to justify such n|>evity.

    Conversely, how much did it cost to deal
    with problems caused by leaving these interlocks off?

    Way toooooo much. The SW delay to get all those things right cost more
    time than HW designers could have possibly saved leaving them out.

    The cores have multiple, parallel pipelines for int, lsq, fadd and fmul. Without exception interlocks, each pipeline only obeys the scoreboard
    rules for when to writeback its result register: WAW and WAR.
    That allows a younger, faster instruction to finish and write its register before an older, slower instruction. If that older instruction then throws
    an exception and does not write its register then we can see the out of
    order register writes.

    For register file writes to be precise in the presence of exceptions
    requires each instruction look ahead at the state of all older
    instructions *in all pipelines*.

    Or you use dead stages in the pipelines so instructions arrive at
    RF write ports no earlier than their compatriots. you still have to
    look across all the delay slots for forwarding opportunities.

    Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.
    A writeback can occur if there are no WAW or WAR dependencies,
    and all older uOps are Resolved_Normal.

    That is the scoreboard model. The Reservation station has a simpler
    model by providing unique register for each instruction (or -|Op).

    Just off the top of my head, in addition to the normal scoreboard,
    a FIFO buffer with a priority selector could be used to look ahead
    at all older uOps and check their status,

    Such a block of logic is called a ReOrder Buffer.

    Given an architectural register file with 16-32 entries, and
    given a reorder buffer of 96+ entries--if you integrate both
    ARF and RoB into a single structure you call it a physical
    register file. A PRF is just a RoB that is big enough never
    to have to migrate registers to the ARF.

    and allow or stall uOp
    writebacks and ensure registers always appear precise.
    Which really doesn't look that expensive.

    Is there something I missed, or would that FIFO suffice?

    If the FiFo is big enough, it works just fine; if you scrimp on
    the FiFo, you will want to play games with orderings to make it
    faster.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Dec 2 07:10:16 2025
    From Newsgroup: comp.arch

    Semi-unaligned memory tradeoff. If unaligned access is required, the
    memory logic just increments the physical address by 64 bytes to fetch
    the next cache line. The issue with this is it does not go backwards to
    get the address fetched again from the TLB. Meaning no check is made for protection or translation of the address.

    It would be quite slow to have the instructions reissued and percolate
    down the cache access again.

    This should only be an issue if an unaligned access crosses a memory
    page boundary.

    The instruction causes an alignment fault if a page cross boundary is detected.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Dec 2 18:50:12 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    The issue with this is it does not go backwards to
    get the address fetched again from the TLB. Meaning no check is made for >protection or translation of the address.

    It would be quite slow to have the instructions reissued and percolate
    down the cache access again.

    Unaligned access on a page boundary is extremely slow on the Core 2
    Duo (IIRC 160 cycles for a store). So don't be shy:-)

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 2 19:55:43 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Semi-unaligned memory tradeoff. If unaligned access is required, the
    memory logic just increments the physical address by 64 bytes to fetch
    the next cache line. The issue with this is it does not go backwards to
    get the address fetched again from the TLB. Meaning no check is made for protection or translation of the address.

    You can determine is an access is misaligned "enough" to warrant two
    trips down the pipe.
    a) crosses cache width
    b) crosses page boundary

    Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.

    It would be quite slow to have the instructions reissued and percolate
    down the cache access again.

    An AGEN-like adder has 11-gates of delay, you can determine misaligned
    by 4-gates of delay.

    This should only be an issue if an unaligned access crosses a memory
    page boundary.

    Here you need to access the TLB twice.

    The instruction causes an alignment fault if a page cross boundary is detected.

    probably not as wise as you think.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Dec 2 21:20:33 2025
    From Newsgroup: comp.arch

    On 2025-12-02 2:55 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Semi-unaligned memory tradeoff. If unaligned access is required, the
    memory logic just increments the physical address by 64 bytes to fetch
    the next cache line. The issue with this is it does not go backwards to
    get the address fetched again from the TLB. Meaning no check is made for
    protection or translation of the address.

    You can determine is an access is misaligned "enough" to warrant two
    trips down the pipe.
    a) crosses cache width
    b) crosses page boundary

    Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.

    It would be quite slow to have the instructions reissued and percolate
    down the cache access again.

    An AGEN-like adder has 11-gates of delay, you can determine misaligned
    by 4-gates of delay.

    I was thinking in terms of clock cycles. The recalc of the address could
    be triggered by resetting bits in the reorder buffer. Which causes the instruction to be re-dispatched. I am not sure how many clocks, but
    likely a minimum of four or five. Memory access is sequential, so it
    will stall other accesses too.

    I have a tendency not to think about the gate delays too much, until
    they appear on the timing path. The lookup tables can absorb a good
    chunk of gates delay.


    This should only be an issue if an unaligned access crosses a memory
    page boundary.

    Here you need to access the TLB twice.

    The instruction causes an alignment fault if a page cross boundary is
    detected.

    probably not as wise as you think.

    I coded it so it makes two trips to the TLB now for page boundaries (in theory). I got to thinking that maybe the page size could be made huge
    to avoid page crossings.

    I may need to put more logic in to ensure the same load store queue slot
    is used. I think it should work since things are sequential.

    My toy is broken. It is taking too long to synthesize. Qupls is so
    complex now. I may pick something simpler.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Thu Dec 4 16:54:56 2025
    From Newsgroup: comp.arch

    In article <20251201132322.000051a5@yahoo.com>,
    Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 01 Dec 2025 07:56:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    [snip]
    If hardware designers put their mind to it, they could make sequential
    consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong"
    ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the
    slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).


    Where does sequential consistency simplifies programming over x86 model
    of "TCO + globally ordered synchronization primitives +
    every synchronization primitives have implied barriers"?

    More so, where it simplifies over ARMv8.1-A, assuming that programmer
    does not try to be too smart and never uses LL/SC and always uses
    8.1-style synchronization instructions with Acquire+Release flags set?

    IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
    non-genius coders.

    Compiler writers have hidden behind the hardware complexity to make
    writing source code that is thread-safe much harder than it should be.
    If you have to support placing hardware barriers, then the languages
    can get away with needing lots of <atomic> qualifiers everywhere, even
    on systems which don't need barriers, making the code more complex. And language purists still love to sneer at volatile in C-like languages as "providing no guarantees, and so is essentially useless"--when volatile providing no guarantees is a language and compiler choice, not something written in stone. A bunch of useful algorithms could be written with
    merely "volatile" like semantics, but for some reason, people like the line-noise-like junk of C++ atomics, where rather than thinking in terms
    of the algorithm, everyone needs to think in terms of release and acquire. (Which are weakly-ordering concepts).

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 4 18:37:54 2025
    From Newsgroup: comp.arch


    kegs@provalid.com (Kent Dickey) posted:

    In article <20251201132322.000051a5@yahoo.com>,
    Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 01 Dec 2025 07:56:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    [snip]
    If hardware designers put their mind to it, they could make sequential
    consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong"
    ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the
    slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).


    Where does sequential consistency simplifies programming over x86 model
    of "TCO + globally ordered synchronization primitives +
    every synchronization primitives have implied barriers"?

    More so, where it simplifies over ARMv8.1-A, assuming that programmer
    does not try to be too smart and never uses LL/SC and always uses
    8.1-style synchronization instructions with Acquire+Release flags set?

    IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
    non-genius coders.

    Compiler writers have hidden behind the hardware complexity to make
    writing source code that is thread-safe much harder than it should be.

    Blaming the wrong people.

    If you have to support placing hardware barriers, then the languages
    can get away with needing lots of <atomic> qualifiers everywhere, even
    on systems which don't need barriers, making the code more complex. And

    Thread-safe, by definition, is (IS) harder.

    language purists still love to sneer at volatile in C-like languages as "providing no guarantees, and so is essentially useless"--when volatile providing no guarantees is a language and compiler choice, not something written in stone.

    The problem with volatile is that all it means is the every time a volatile variable is touched, the code has to have a corresponding LD or ST. The HW
    ends up knowing nothing about the value's volativity and ends up in no
    position to help.

    A bunch of useful algorithms could be written with
    merely "volatile" like semantics, but for some reason, people like the line-noise-like junk of C++ atomics, where rather than thinking in terms
    of the algorithm, everyone needs to think in terms of release and acquire. (Which are weakly-ordering concepts).

    As far as ATOMICs go:: until you can code a single ATOMIC event that moves
    an element of a concurrent data structure from one place to another in a
    single event, you are thinking too SMALL (4-pointers in 4 different cache lines).

    In addition, the code should NOT have to test for success failure, but
    be defined in such a way that if you get here success is known and if
    you get there, failure is known.

    Kent

    Mitch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 11:10:22 2025
    From Newsgroup: comp.arch

    On 04/12/2025 19:37, MitchAlsup wrote:

    kegs@provalid.com (Kent Dickey) posted:


    Thread-safe, by definition, is (IS) harder.

    language purists still love to sneer at volatile in C-like languages as
    "providing no guarantees, and so is essentially useless"--when volatile
    providing no guarantees is a language and compiler choice, not something
    written in stone.

    The problem with volatile is that all it means is the every time a volatile variable is touched, the code has to have a corresponding LD or ST. The HW ends up knowing nothing about the value's volativity and ends up in no position to help.


    "volatile" /does/ provide guarantees - it just doesn't provide enough guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever. But
    you need volatile semantics for atomics and fences as well - there's no
    point in enforcing an order at the hardware level if the accesses can be re-ordered at the software level!

    "volatile" on its own is therefore not sufficient for atomics on big
    modern processors. But it /is/ sufficient for some uses, such as
    accessing hardware registers, or for small atomic loads and stores on
    single processor systems (which are far and away the biggest market, as embedded microcontrollers).

    As I see it, the biggest problem with "volatile" in C is
    misunderstandings and misuse of all sorts. At least, that's what I see
    in my field of embedded development.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Dec 5 14:37:57 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),
    and I don't think that C with just volatile gives you such guarantees.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 18:29:48 2025
    From Newsgroup: comp.arch

    On 05/12/2025 15:37, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".

    It says a good deal about the ordering at the C level - but nothing
    about it at the memory level.

    I know very little about the MMU setups on "big" systems like the x86-64 world. But in the embedded microcontroller world, it is very common for
    areas of the memory map to have sequential consistency even if other
    areas can be re-ordered, cached, or otherwise jumbled around. Thus for memory-mapped peripheral areas, memory accesses are kept strictly in
    order and "volatile" is all you need.

    If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.

    Sure. Of course multi-core systems will not have that hardware
    guarantee, at least not on main memory, for performance reasons. So
    there you need something more than just C "volatile" to force specific orderings. But volatile semantics will still be needed in many cases.
    Thus "volatile" is not sufficient, but it is still necessary. Usually,
    of course, all necessary "volatile" qualifiers are included in OS or
    library macros or functions for anything that needs them for locks or inter-process communication and the like. (In Linux, you have the
    READ_ONCE and WRITE_ONCE macros, which are just wrappers forcing
    volatile accesses.)


    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),
    and I don't think that C with just volatile gives you such guarantees.


    Correct.

    Getting this wrong is one of the problems I have seen with volatile
    usage in embedded systems. I've seen people assuming that declaring "x"
    as "volatile" means that "x++;" is an atomic operation, or that volatile
    alone lets you share 64-bit data between threads on a 32-bit processor.

    Used correctly, it /can/ be enough for shared data between pre-emptive
    threads or a main loop and interrupts on a single core system. But
    sometimes you need to do more (for microcontrollers, that usually means disabling interrupts for a short period).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 5 17:57:48 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that >affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage >queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    and I don't think that C with just volatile gives you such guarantees.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 20:10:11 2025
    From Newsgroup: comp.arch

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor. Even with a
    single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 5 20:54:00 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems. Basically, >>> it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Even with a single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 5 14:55:36 2025
    From Newsgroup: comp.arch

    On 12/5/2025 12:54 PM, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Any issues with live lock in here?

    [...]
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 5 15:03:53 2025
    From Newsgroup: comp.arch

    On 12/5/2025 11:10 AM, David Brown wrote:
    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems.-a Basically, >>>> it only works at the C abstract machine level - it does nothing that
    affects the hardware.-a So volatile writes are ordered at the C level, >>>> but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    It's strange that double-word compare and swap (DWCAS), where the words
    are contiguous. Well, I have seen compilers say its not lock-free even
    on a x86. for a 32 bit system we have cmpxchg8b. For a 64 bit system cmpxchg16b. But the compiler reports not lock free. Strange.

    using cmpxchg instead of xadd:
    https://forum.pellesc.de/index.php?topic=7167.0

    trying to tell me that a DWCAS is not lock free: https://forum.pellesc.de/index.php?topic=7311.msg27764#msg27764

    This should be lock-free on an x86, even x64:

    struct ct_proxy_dwcas
    {
    struct ct_proxy_node* node;
    intptr_t count;
    };

    some of my older code:

    AC_SYS_APIEXPORT
    int AC_CDECL
    np_ac_i686_atomic_dwcas_fence
    ( void*,
    void*,
    const void* );


    np_ac_i686_atomic_dwcas_fence PROC
    push esi
    push ebx
    mov esi, [esp + 16]
    mov eax, [esi]
    mov edx, [esi + 4]
    mov esi, [esp + 20]
    mov ebx, [esi]
    mov ecx, [esi + 4]
    mov esi, [esp + 12]
    lock cmpxchg8b qword ptr [esi]
    jne np_ac_i686_atomic_dwcas_fence_fail
    xor eax, eax
    pop ebx
    pop esi
    ret

    np_ac_i686_atomic_dwcas_fence_fail:
    mov esi, [esp + 16]
    mov [esi + 0], eax;
    mov [esi + 4], edx;
    mov eax, 1
    pop ebx
    pop esi
    ret
    np_ac_i686_atomic_dwcas_fence ENDP


    Even with a
    single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees.

    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 00:40:11 2025
    From Newsgroup: comp.arch

    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count. Removing the bypassing of r0 from the register file shaved 1000
    LUTs off the design. This is no real loss as most instructions can
    substitute small constants for register values.

    Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
    to zero only in the agen units. So, the bypass is only in a couple of
    places. Otherwise r0 can be used as an ordinary register. Load / store instructions cannot use r0 as a GPR then, but it works for the PowerPC.

    I hit this trying to decide where to bypass another register code to
    represent the instruction pointer. In that case I think it may be better
    to go RISCV style and just add an instruction to add the IP to a
    constant and place it in a register. The alternative might be to
    sacrifice a bit of displacement to indicate IP relative addressing.

    Anyone got a summary of bypassing r0 in different architectures?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Dec 6 07:26:24 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count.

    My impression is that modern implementations deal with this kind of
    stuff at decoding or in the renamer. That should reduce the number of
    places where it is special-cased to one, but it means that the uops
    have to represent 0 in some way. One way would be to have a physical
    register that is 0 and that is never allocated, but if your
    microarchitecture needs a reduction of actual read ports (compared to
    potential read ports), you may prefer a different representation of 0
    in the uops.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 05:13:01 2025
    From Newsgroup: comp.arch

    On 2025-12-06 2:26 a.m., Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count.

    My impression is that modern implementations deal with this kind of
    stuff at decoding or in the renamer. That should reduce the number of
    places where it is special-cased to one, but it means that the uops
    have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
    microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
    in the uops.

    - anton

    Thanks,

    It should have occurred to me to do this at the decode stage. Constants
    are decoded and passed along for all register fields in decode. There
    are only four decoders fortunately.

    Switching the ISA back to having r0 as zero all the time.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sat Dec 6 14:42:13 2025
    From Newsgroup: comp.arch

    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction. Or am I missing something here?

    It is not easy to have atomic or lock mechanisms on multi-core systems
    that are convenient to use, efficient even in the worst cases, and don't require additional hardware.


    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.


    Of course. It is interesting to speculate about possible features of an architecture like yours, but it is not likely to be available to anyone
    else in practice (unless perhaps it can be implemented as an extension
    for RISC-V).

    Even with a
    single core system you can have pre-emptive multi-threading, or at least
    interrupt routines that may need to cooperate with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees. >>>>
    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Dec 6 17:16:11 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems. Basically, >> >>> it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.

    Individual atomic "Remove Element" and "Insert/Append Element"[*], yes. Combined? Too inflexible.

    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Atomic add/sub are useful. The other atomic math operations (min, max, etc) may be useful in certain cases as well.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:22:55 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/5/2025 12:54 PM, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems), >>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Any issues with live lock in here?

    A bit hard to tell because of 2 things::
    a) I carry around the thread priority and when interference occurs,
    the higher priority thread wins--ties the already accessed thread wins.
    b) live-lock is resolved or not by the caller to these routines, not
    these routines themselves.

    [...]
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:29:53 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count. Removing the bypassing of r0 from the register file shaved 1000
    LUTs off the design. This is no real loss as most instructions can substitute small constants for register values.

    Often the use of R0 as an operand causes the calculation to be degenerate.
    That is, R0 is not needed at all.
    ADD R9,R7,R0 // is a MOV instruction
    AND R9,R7,R0 // is a CLR instruction

    So, you don't have to treat R0 in bypassing, but as Operand processing.

    Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
    to zero only in the agen units. So, the bypass is only in a couple of places. Otherwise r0 can be used as an ordinary register. Load / store instructions cannot use r0 as a GPR then, but it works for the PowerPC.

    AGEN Rbase ==R0 implies Rbase = IP
    AGEN Rindex==R0 implies Rindex = 0

    I hit this trying to decide where to bypass another register code to represent the instruction pointer. In that case I think it may be better
    to go RISCV style and just add an instruction to add the IP to a
    constant and place it in a register. The alternative might be to
    sacrifice a bit of displacement to indicate IP relative addressing.

    Anyone got a summary of bypassing r0 in different architectures?

    These are some of the reasons I went with
    a) universal constants
    b) R0 is just another GPR
    So, R0, gets forwarded just as often (or lack thereof) as any joe-random register.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:31:43 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Robert Finch <robfi680@gmail.com> writes:
    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count.

    My impression is that modern implementations deal with this kind of
    stuff at decoding or in the renamer. That should reduce the number of
    places where it is special-cased to one, but it means that the uops
    have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
    microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
    in the uops.

    Another way to implement R0 is to have an AND gate after the Operand
    flip-flop, and if <whatever> was captured is R0, then AND with 0, other-
    wise AND with 1.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:44:30 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems), >>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction. Or am I missing something here?

    In the above, I was stating that the maximum width of LD/ST can be a lot
    bigger than the size of a single register, not that the above instructions
    make writing ATOMIC events easier.

    These is no bus!

    The esmLOCKload causes the <translated> address to be 'monitored'
    for interference, and to announce participation in the ATOMIC event.

    The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
    AND sets up a default control point (This instruction itself) so that
    if interference is detected at esmLOCKstore control is transferred to
    that control point.

    So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set
    for free.

    There is a branch-on-interference instruction that
    a) does what it says,
    b) sets up an alternate atomic control point.

    It is not easy to have atomic or lock mechanisms on multi-core systems
    that are convenient to use, efficient even in the worst cases, and don't require additional hardware.

    I am using the "Miss Buffer" as the point of monitoring for interference.
    a) it already has to monitor "other hits" from outside accesses to deal
    with the coherence mechanism.
    b) that esm additions to Miss Buffer are on the order of 2%

    c) there are other means to strengthen guarantees of forward progress.


    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Of course. It is interesting to speculate about possible features of an architecture like yours, but it is not likely to be available to anyone
    else in practice (unless perhaps it can be implemented as an extension
    for RISC-V).

    Even with a
    single core system you can have pre-emptive multi-threading, or at least >> interrupt routines that may need to cooperate with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees. >>>>
    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 18:07:50 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >> >>> guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that >> >>> affects the hardware. So volatile writes are ordered at the C level, >> >>> but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency". >> >> If hardware guaranteed sequential consistency, volatile would provide >> >> guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems), >> >
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.

    The question is not would "MoveElement" be useful, but
    would it be useful to have a single ATOMIC event be
    able to manipulate {5,6,7,8} pointers in one event ??

    Individual atomic "Remove Element" and "Insert/Append Element"[*], yes. Combined? Too inflexible.

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }

    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Atomic add/sub are useful. The other atomic math operations (min, max, etc) may be useful in certain cases as well.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Dec 6 19:04:09 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:


    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.

    The question is not would "MoveElement" be useful, but
    would it be useful to have a single ATOMIC event be
    able to manipulate {5,6,7,8} pointers in one event ??

    Nothing comes immediately to mind.


    Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
    Combined? Too inflexible.

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }

    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++ threading functionality.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Dec 6 21:36:27 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++ threading functionality.

    Interestingly, Linux restartable sequences allow for acquisition of
    a lock with no membarrier or atomic instruction on the fast path,
    at the cost of a syscall on the slow path (no free lunch...)

    But you also need assembler to do it.

    An example is, for example, at https://gitlab.ethz.ch/extra_projects/cpu-local-lock
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 21:44:17 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:


    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.

    The question is not would "MoveElement" be useful, but
    would it be useful to have a single ATOMIC event be
    able to manipulate {5,6,7,8} pointers in one event ??

    Nothing comes immediately to mind.


    Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
    Combined? Too inflexible.

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }

    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    So, in other words, if you can't put it in every ISA known to man,
    don't bother making something better than existent ?!?

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++ threading functionality.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 18:33:55 2025
    From Newsgroup: comp.arch

    On 2025-12-06 12:29 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count. Removing the bypassing of r0 from the register file shaved 1000
    LUTs off the design. This is no real loss as most instructions can
    substitute small constants for register values.

    Often the use of R0 as an operand causes the calculation to be degenerate. That is, R0 is not needed at all.
    ADD R9,R7,R0 // is a MOV instruction
    AND R9,R7,R0 // is a CLR instruction

    We dont want no degenerating instructions.

    So, you don't have to treat R0 in bypassing, but as Operand processing.

    Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
    to zero only in the agen units. So, the bypass is only in a couple of
    places. Otherwise r0 can be used as an ordinary register. Load / store
    instructions cannot use r0 as a GPR then, but it works for the PowerPC.

    AGEN Rbase ==R0 implies Rbase = IP
    AGEN Rindex==R0 implies Rindex = 0

    Qupls now follows a similar paradigm.
    Rbase = r0 bypasses to 0
    Rindex = r0 bypasses to 0
    Rbase = r31 bypasses to IP
    Bypassing r0 for both base and index allows absolute addressing mode.
    Otherwise r0, r31 are general-purpose regs.

    I hit this trying to decide where to bypass another register code to
    represent the instruction pointer. In that case I think it may be better
    to go RISCV style and just add an instruction to add the IP to a
    constant and place it in a register. The alternative might be to
    sacrifice a bit of displacement to indicate IP relative addressing.

    Anyone got a summary of bypassing r0 in different architectures?

    These are some of the reasons I went with
    a) universal constants
    b) R0 is just another GPR
    So, R0, gets forwarded just as often (or lack thereof) as any joe-random register.

    Qupls has IP offset constant loading.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 18:55:17 2025
    From Newsgroup: comp.arch

    On 2025-12-06 6:33 p.m., Robert Finch wrote:
    On 2025-12-06 12:29 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count. Removing the bypassing of r0 from the register file shaved 1000
    LUTs off the design. This is no real loss as most instructions can
    substitute small constants for register values.

    Often the use of R0 as an operand causes the calculation to be
    degenerate.
    That is, R0 is not needed at all.
    -a-a-a-a-a ADD-a-a R9,R7,R0-a-a-a-a-a-a-a // is a MOV instruction
    -a-a-a-a-a AND-a-a R9,R7,R0-a-a-a-a-a-a-a // is a CLR instruction

    We dont want no degenerating instructions.

    So, you don't have to treat R0 in bypassing, but as Operand processing.
    Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple of
    places. Otherwise r0 can be used as an ordinary register. Load / store
    instructions cannot use r0 as a GPR then, but it works for the PowerPC.

    AGEN Rbase ==R0 implies Rbase-a = IP
    AGEN Rindex==R0 implies Rindex = 0

    Qupls now follows a similar paradigm.
    -aRbase = r0 bypasses to 0
    -aRindex = r0 bypasses to 0
    -aRbase = r31 bypasses to IP
    Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.

    I hit this trying to decide where to bypass another register code to
    represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
    constant and place it in a register. The alternative might be to
    sacrifice a bit of displacement to indicate IP relative addressing.

    Anyone got a summary of bypassing r0 in different architectures?

    These are some of the reasons I went with
    a) universal constants
    b) R0 is just another GPR
    So, R0, gets forwarded just as often (or lack thereof) as any joe-random
    register.

    Qupls has IP offset constant loading.



    No sooner than having updated the spec, I added two more opcodes to
    perform loads and stores using IP relative addressing. That way, no need
    to use r31, leaving 31 registers completely general purpose. I am
    wanting to cast some aspects of the ISA in stone, or it will never get anywhere.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 7 03:29:05 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-12-06 6:33 p.m., Robert Finch wrote:
    On 2025-12-06 12:29 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count. Removing the bypassing of r0 from the register file shaved 1000 >>> LUTs off the design. This is no real loss as most instructions can
    substitute small constants for register values.

    Often the use of R0 as an operand causes the calculation to be
    degenerate.
    That is, R0 is not needed at all.
    -a-a-a-a-a ADD-a-a R9,R7,R0-a-a-a-a-a-a-a // is a MOV instruction
    -a-a-a-a-a AND-a-a R9,R7,R0-a-a-a-a-a-a-a // is a CLR instruction

    We dont want no degenerating instructions.

    So, you don't have to treat R0 in bypassing, but as Operand processing. >>> Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple of
    places. Otherwise r0 can be used as an ordinary register. Load / store >>> instructions cannot use r0 as a GPR then, but it works for the PowerPC. >>
    AGEN Rbase ==R0 implies Rbase-a = IP
    AGEN Rindex==R0 implies Rindex = 0

    Qupls now follows a similar paradigm.
    -aRbase = r0 bypasses to 0
    -aRindex = r0 bypasses to 0
    -aRbase = r31 bypasses to IP
    Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.

    I hit this trying to decide where to bypass another register code to
    represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
    constant and place it in a register. The alternative might be to
    sacrifice a bit of displacement to indicate IP relative addressing.

    Anyone got a summary of bypassing r0 in different architectures?

    These are some of the reasons I went with
    a) universal constants
    b) R0 is just another GPR
    So, R0, gets forwarded just as often (or lack thereof) as any joe-random >> register.

    Qupls has IP offset constant loading.



    No sooner than having updated the spec, I added two more opcodes to
    perform loads and stores using IP relative addressing. That way, no need
    to use r31, leaving 31 registers completely general purpose. I am
    wanting to cast some aspects of the ISA in stone, or it will never get anywhere.

    Cast some elements in plaster--this will hold for a few years until
    you find the bigger mistakes, then demolish the plaster and fix the
    parts that don't work so well.

    After 6 years of essential stability, I did a major update to My 66000
    ISA last month. The new ISA is ASCII compatible with the last, but not
    at the binary level, which solves several problems and saves another
    2%-4% in code footprint.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 7 09:30:50 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    This would likely be hidden in a header, and need only be
    written once (although gcc and clang, for example, are compatible
    in this respecct). And people have been doing this, even for
    microarchitecture specific features, if the need for performance
    gain is large enough.

    A primary example is Intel TSX, which is (was?) required by SAP.

    POWER also had a transactional memory feature, but they messed it
    up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
    only other architecture certified to run SAP, so it seems they
    can do without.

    Googling around, I also find the "Transactional Memory Extension"
    for ARM datetd 2022, so ARM also appears to see some value in that,
    at least enough to write a spec for it.

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++ threading functionality.

    It seems there is a market for going beyond compare and swap.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Dec 7 16:05:32 2025
    From Newsgroup: comp.arch

    On Sun, 7 Dec 2025 09:30:50 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers
    will be unlikely to generate them, thus applications that desired
    the generation of such an instruction would need to create a
    compiler extension (like gcc __builtin functions) or inline
    assembler which would then make the program that uses the
    capability both compiler specific _and_ hardware specific.

    This would likely be hidden in a header, and need only be
    written once (although gcc and clang, for example, are compatible
    in this respecct). And people have been doing this, even for microarchitecture specific features, if the need for performance
    gain is large enough.

    A primary example is Intel TSX, which is (was?) required by SAP.


    By SAP HANA, I assume.
    Not sure for how long it was true. It sounds very unlikely that it is
    still true.

    POWER also had a transactional memory feature, but they messed it
    up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
    only other architecture certified to run SAP, so it seems they
    can do without.

    Googling around, I also find the "Transactional Memory Extension"
    for ARM datetd 2022, so ARM also appears to see some value in that,
    at least enough to write a spec for it.

    Most extant SMP processors provide a compare and swap operation,
    which are widely supported by the common compilers that support the
    C and C++ threading functionality.

    It seems there is a market for going beyond compare and swap.

    TSX is close to dead.

    ARM's TME was announced almost 5 years ago. AFAIK, there were no implementations. Recently ARM said that FEAT_TME is obsoleted. It sounds
    like the whole thing is dead, but there is small chance that I am misinterpreting.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Dec 7 16:13:06 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:



    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    So, in other words, if you can't put it in every ISA known to man,
    don't bother making something better than existent ?!?

    Long experience. Back in the early 80's we had fancy instructions
    for searching linked lists (up to 100 digit or byte keys, comparisons for equal, ne, lt, gt, lte, gte, and any-bit-equal). Took special language support to use, which mean that it wasn't usable from COBOL without
    extensions. We also had Lock, Unlock and condition variable instructions (with a small microkernel to handle the contention cases, trapping on acquisition failure, release [when another thread was pending], and
    event signal.). Perhaps ahead of its time, as most of the common languages (COBOL and Fortran) had no syntactical support for them. We used them
    in the OS language (SPRITE), but they never got traction in applications (and then the
    entire computer line was discontinued in 1991).

    That's not to suggest that your innovations aren't potentially useful
    or an interesting take on multithreaded instruction primitives;
    just that idealism and the real world are often incompatible :-)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Dec 7 16:28:41 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    This would likely be hidden in a header, and need only be
    written once (although gcc and clang, for example, are compatible
    in this respecct). And people have been doing this, even for >microarchitecture specific features, if the need for performance
    gain is large enough.

    A primary example is Intel TSX, which is (was?) required by SAP.

    POWER also had a transactional memory feature, but they messed it
    up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
    only other architecture certified to run SAP, so it seems they
    can do without.

    Googling around, I also find the "Transactional Memory Extension"
    for ARM datetd 2022, so ARM also appears to see some value in that,
    at least enough to write a spec for it.

    The ARM spec has been published. I'm not aware of any implementations
    of it to date, and the spec had been available to architecture partners
    for several years prior to 2022.

    Intel's TSX support seems to be restricted to a subset of xeon processors,
    and it's not clear how well it's supported by non-intel compilers.

    AMD has never released their Advanced Synchronization Facility in any
    processor to date.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 7 16:55:26 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 7 Dec 2025 09:30:50 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers
    will be unlikely to generate them, thus applications that desired
    the generation of such an instruction would need to create a
    compiler extension (like gcc __builtin functions) or inline
    assembler which would then make the program that uses the
    capability both compiler specific _and_ hardware specific.

    This would likely be hidden in a header, and need only be
    written once (although gcc and clang, for example, are compatible
    in this respecct). And people have been doing this, even for
    microarchitecture specific features, if the need for performance
    gain is large enough.

    A primary example is Intel TSX, which is (was?) required by SAP.


    By SAP HANA, I assume.
    Not sure for how long it was true. It sounds very unlikely that it is
    still true.

    https://www.redhat.com/en/blog/red-hat-enterprise-linux-performance-results-5th-gen-intel-xeon-scalable-processors
    from 2024 has benchmarks with TSX for SAP/HANA, and the processors
    (5th generation Xeon) at least pretend to have TSX.

    https://community.sap.com/t5/technology-blog-posts-by-sap/seamless-scaling-of-sap-hana-on-intel-xeon-processors-from-micro-to-mega/ba-p/13968648
    (almost a year old) writes

    "Intel's Transactional Synchronization Extensions (TSX), also
    implemented into the SAP HANA database, further enhances this
    scalability and offers a significant performance boost for critical
    HANA database operations."

    which does not read "required", but certainly sounds like it is an
    advantage.

    POWER also had a transactional memory feature, but they messed it
    up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
    only other architecture certified to run SAP, so it seems they
    can do without.

    Googling around, I also find the "Transactional Memory Extension"
    for ARM datetd 2022, so ARM also appears to see some value in that,
    at least enough to write a spec for it.

    Most extant SMP processors provide a compare and swap operation,
    which are widely supported by the common compilers that support the
    C and C++ threading functionality.

    It seems there is a market for going beyond compare and swap.

    TSX is close to dead.

    For general-purpose computers, it seems the security implications
    killed it. An SAP server is a different matter; if you don't trust
    the software you are running there, you have other issues.


    ARM's TME was announced almost 5 years ago. AFAIK, there were no implementations. Recently ARM said that FEAT_TME is obsoleted. It sounds
    like the whole thing is dead, but there is small chance that I am misinterpreting.

    Maybe restartable sequences are the way to go for lock-free
    critical sections. Not sure if everybody is aware of these. A good introduction can be found at https://lwn.net/Articles/883104/ .
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Dec 7 12:19:34 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    scott@slp53.sl.home (Scott Lurndal) posted:
    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.
    The question is not would "MoveElement" be useful, but
    would it be useful to have a single ATOMIC event be
    able to manipulate {5,6,7,8} pointers in one event ??

    Nothing comes immediately to mind.

    Atomically moving an object from one double linked list to another,
    like when a thread wakes up and moves from the waiting to ready list.

    One iteration of balancing a binary tree (AVL, red-black)

    Plus the data structs above might straddle cache lines so how ever many
    objects there are, there could be twice the lines being updated at once.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Dec 7 17:48:50 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    Where does sequential consistency simplifies programming over x86 model
    of "TCO + globally ordered synchronization primitives +
    every synchronization primitives have implied barriers"?

    More so, where it simplifies over ARMv8.1-A, assuming that programmer
    does not try to be too smart and never uses LL/SC and always uses
    8.1-style synchronization instructions with Acquire+Release flags set?

    IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
    non-genius coders.

    Is single-core multi-threaded programming bearable to non-genius
    programmers? I think so. Sequential consistency plus atomic sequences
    (where the single-core program disables interrupts to start an atomic
    sequence and enables them to end an atomic sequence) gives the same
    programming model.

    Concerning synchronization instructions and memory barriers of
    architectures with weaker memory models, their main problem is that
    they are implemented slowly, because the idea is to make only the
    weaker memory model go fast, and then suffer what you must if you need
    more guarantees. Already the guarantee makes them slow, not just the
    actual synchronization case. This makes the memory model hard to use,
    because you want to minimize the use of these instructions. And
    that's where the need for genius-level coding comes in.

    As for the size of the description, IMO this reflects on the
    simplicity of programming. ARM's memory model was advertized here as:
    "It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. If it is
    simple to program, why does it need 32 pages of description?

    Concerning non-genius coders and coders that are not experts in memory
    ordering models, the current setup seems to be design to have a few
    people who program system software that does such things, and
    everybody else should just use this software (whether it's system
    calls or libraries). That's ok if the need to communicate between
    threads is rare, but not so great if it is frequent (especially the
    system-call variant). And if the need to communicate between threads
    is rare, it's also good enough if the hardware features for that need
    are slow. So maybe this whole setup is good enough.

    OTOH, maybe there are applications that could potentially use multiple
    threads that are currently using sequential programs or context
    switching within a hardware thread (green threads and the like)
    because the communication between the threads is too slow and making
    it faster is too hard to program. In that case the underutilization
    of many of the multi-core CPUs that we have may be due to this
    phenomenon. If so, the argument that it's too expensive in hardware
    resources to implement sequential consistency in hardware well does
    not hold: Is it more expensive than implementing an 8-core CPU where 6 or 7 cores are usually not utilized?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 14:51:01 2025
    From Newsgroup: comp.arch

    On 12/5/2025 3:03 PM, Chris M. Thomasson wrote:
    On 12/5/2025 11:10 AM, David Brown wrote:
    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    It's strange that double-word compare and swap (DWCAS), where the words
    are contiguous. Well, I have seen compilers say its not lock-free even
    on a x86. for a 32 bit system we have cmpxchg8b. For a 64 bit system cmpxchg16b. But the compiler reports not lock free. Strange.

    using cmpxchg instead of xadd: https://forum.pellesc.de/index.php?topic=7167.0

    trying to tell me that a DWCAS is not lock free: https://forum.pellesc.de/index.php?topic=7311.msg27764#msg27764

    This should be lock-free on an x86, even x64:

    struct ct_proxy_dwcas
    {
    -a-a-a struct ct_proxy_node* node;
    -a-a-a intptr_t count;
    };

    Ideally, struct ct_proxy_dwcas should be aligned on a l2 cache line and
    padded up the the size of a cache line.




    some of my older code:

    AC_SYS_APIEXPORT
    int AC_CDECL
    np_ac_i686_atomic_dwcas_fence
    ( void*,
    -a void*,
    -a const void* );


    np_ac_i686_atomic_dwcas_fence PROC
    -a push esi
    -a push ebx
    -a mov esi, [esp + 16]
    -a mov eax, [esi]
    -a mov edx, [esi + 4]
    -a mov esi, [esp + 20]
    -a mov ebx, [esi]
    -a mov ecx, [esi + 4]
    -a mov esi, [esp + 12]
    -a lock cmpxchg8b qword ptr [esi]
    -a jne np_ac_i686_atomic_dwcas_fence_fail
    -a xor eax, eax
    -a pop ebx
    -a pop esi
    -a ret

    np_ac_i686_atomic_dwcas_fence_fail:
    -a mov esi, [esp + 16]
    -a mov [esi + 0],-a eax;
    -a mov [esi + 4],-a edx;
    -a mov eax, 1
    -a pop ebx
    -a pop esi
    -a ret
    np_ac_i686_atomic_dwcas_fence ENDP


    Even with a single core system you can have pre-emptive multi-
    threading, or at least interrupt routines that may need to cooperate
    with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees. >>>>
    - anton



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 15:09:15 2025
    From Newsgroup: comp.arch

    On 12/6/2025 9:22 AM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/5/2025 12:54 PM, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
    You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Any issues with live lock in here?

    A bit hard to tell because of 2 things::
    a) I carry around the thread priority and when interference occurs,
    the higher priority thread wins--ties the already accessed thread wins. b) live-lock is resolved or not by the caller to these routines, not
    these routines themselves.

    Hummm... Iirc, I was able to cause damage to a strong CAS. It was around
    20 years ago. A thread was running strong CAS in a tight loop. I counted success vs failure. Then allowed some other threads that altered the
    target word with random data. The failure rate for the CAS increased. Actually, I think cmpxchg, cmpxchg8b, cmpxchg16b, and the strange one on Itanium. Cannot remember it right now. cmp8xchg16? Or some shit.

    Well, they would hit a bus lock if they failed too many times. I think
    Scott knows about it.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 15:17:04 2025
    From Newsgroup: comp.arch

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems), >>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
    MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction.-a Or am I missing something here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache
    line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?

    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 16:08:03 2025
    From Newsgroup: comp.arch

    On 12/6/2025 1:36 PM, Thomas Koenig wrote:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++
    threading functionality.

    Interestingly, Linux restartable sequences allow for acquisition of
    a lock with no membarrier or atomic instruction on the fast path,
    at the cost of a syscall on the slow path (no free lunch...)

    But you also need assembler to do it.

    An example is, for example, at https://gitlab.ethz.ch/extra_projects/cpu-local-lock


    I need to read more about them, but they kind of remind me of an
    asymmetric mutex, or rwmutex. Ones that use a remote membar on the slow
    path. Iirc, FlushProcessWriteBuffers on windows and iirc,
    synchronize_rcu or membarrier on linux.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 16:36:59 2025
    From Newsgroup: comp.arch

    On 12/6/2025 10:07 AM, MitchAlsup wrote:

    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
    You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.

    The question is not would "MoveElement" be useful, but
    would it be useful to have a single ATOMIC event be
    able to manipulate {5,6,7,8} pointers in one event ??

    Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
    Combined? Too inflexible.

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }

    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Atomic add/sub are useful. The other atomic math operations (min, max, etc) >> may be useful in certain cases as well.

    Have you ever read about KCSS?

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs

    https://patents.google.com/patent/US7293143
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Dec 8 10:07:25 2025
    From Newsgroup: comp.arch

    On 06/12/2025 18:44, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
    You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction. Or am I missing something here?

    In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.


    That's what I assumed.

    Certainly there are situations where it can be helpful to have longer
    atomic reads and writes. I am not so sure about allowing 8 KB atomic accesses, especially in a system with multiple cores - that sounds like letting user programs DoS everything else on the system.

    These is no bus!

    I think there's a typo or some missing words there?


    The esmLOCKload causes the <translated> address to be 'monitored'
    for interference, and to announce participation in the ATOMIC event.

    The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
    AND sets up a default control point (This instruction itself) so that
    if interference is detected at esmLOCKstore control is transferred to
    that control point.

    So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.

    If I understand you correctly here, you basically have a "load-reserve / store-conditional" sequence as commonly found in RISC architectures, but
    you have the associated loop built into the hardware? I can see that potentially improving efficiency, but I also find it very difficult to
    read or write C code that has hidden loops. And I worry about how it
    would all work if another thread on the same core or a different core
    was running similar code in the middle of these sequences. It also
    reduces the flexibility - in some use-cases, you want to have software
    limits on the number of attempts of a lr/sc loop to detect serious synchronisation problems.


    There is a branch-on-interference instruction that
    a) does what it says,
    b) sets up an alternate atomic control point.

    It is not easy to have atomic or lock mechanisms on multi-core systems
    that are convenient to use, efficient even in the worst cases, and don't
    require additional hardware.

    I am using the "Miss Buffer" as the point of monitoring for interference.
    a) it already has to monitor "other hits" from outside accesses to deal
    with the coherence mechanism.
    b) that esm additions to Miss Buffer are on the order of 2%

    c) there are other means to strengthen guarantees of forward progress.


    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Of course. It is interesting to speculate about possible features of an
    architecture like yours, but it is not likely to be available to anyone
    else in practice (unless perhaps it can be implemented as an extension
    for RISC-V).

    Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>

    and I don't think that C with just volatile gives you such guarantees. >>>>>>
    - anton



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Dec 8 10:12:19 2025
    From Newsgroup: comp.arch

    On 08/12/2025 00:17, Chris M. Thomasson wrote:
    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide
    enough
    guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware.-a So volatile writes are ordered at the C >>>>>>> level,
    but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
    You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core
    systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
    MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>

    The functions below rely on more than that - to make the work, as far
    as I can see, you need the first "esmLOCKload" to lock the bus and
    also lock the core from any kind of interrupt or other pre-emption,
    lasting until the esmLOCKstore instruction.-a Or am I missing something
    here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?


    Yes, I meant "lock the bus" - but I might have been overcautious.
    However, it seems there is a hidden hardware loop here - the
    esmLOCKstore instruction can fail and and the processor jumps back to
    the first esmLOCKload instruction. With that, you don't need to block
    other code from running or accessing the bus.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Mon Dec 8 07:25:42 2025
    From Newsgroup: comp.arch

    <snip>
    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    So, in other words, if you can't put it in every ISA known to man,
    don't bother making something better than existent ?!?

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++
    threading functionality.

    I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

    I am assuming the esmLockStore() just unlocks what was previously locked
    and the stores have already happened by that time.

    It would seem that esmINTERFERENCE() would indicate that everybody with
    access out to the coherence point has agreed to the locked area? Does
    that require that all devices respect the esmINTERFERENCE()?


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 8 04:32:39 2025
    From Newsgroup: comp.arch

    On 12/8/2025 1:12 AM, David Brown wrote:
    On 08/12/2025 00:17, Chris M. Thomasson wrote:
    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>> enough
    guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing >>>>>>>> that
    affects the hardware.-a So volatile writes are ordered at the C >>>>>>>> level,
    but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>
    You describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential
    consistency".
    If hardware guaranteed sequential consistency, volatile would
    provide
    guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core
    systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>>

    The functions below rely on more than that - to make the work, as far
    as I can see, you need the first "esmLOCKload" to lock the bus and
    also lock the core from any kind of interrupt or other pre-emption,
    lasting until the esmLOCKstore instruction.-a Or am I missing
    something here?

    Lock the BUS? Only when shit hits the fan. What about locking the
    cache line? Actually, I think we can "force" an x86/x64 to lock the
    bus if we do a LOCK'ed RMW on memory that straddles cache lines?


    Yes, I meant "lock the bus" - but I might have been overcautious.
    However, it seems there is a hidden hardware loop here - the
    esmLOCKstore instruction can fail and and the processor jumps back to
    the first esmLOCKload instruction.-a With that, you don't need to block other code from running or accessing the bus.



    Humm.. For some damn reason it reminds me of a multi lock thing I did a
    while back. Called it the multex. Consisted of a table of locks. A
    thread would take the addresses it wanted to lock, hash then into the
    table, remove duplicates and sorted them and took them all without any
    fear of deadlock.

    (read all when you get some free time to burn...) https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

    It kind of seems like it might want to work with Mitch's scheme in a
    loose sense?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Dec 8 08:23:59 2025
    From Newsgroup: comp.arch

    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>
    BOOLEAN RemoveElement( Element *fr )
    {
    -a-a-a-a fn = esmLOCKload( fr->next );
    -a-a-a-a fp = esmLOCKload( fr->prev );
    -a-a-a-a esmLOCKprefetch( fn );
    -a-a-a-a esmLOCKprefetch( fp );
    -a-a-a-a if( !esmINTERFERENCE() )
    -a-a-a-a {
    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a fp->next = fn;
    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a fn->prev = fp;
    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a fr->prev = NULL;
    -a-a-a-a esmLOCKstore( fr->next,-a NULL );
    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a return TRUE;
    -a-a-a-a }
    -a-a-a-a return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally
    sufficient.

    Yes, you can add special instructions.-a-a However, the compilers will
    be unlikely
    to generate them, thus applications that desired the generation of
    such an
    instruction would need to create a compiler extension (like gcc
    __builtin functions)
    or inline assembler which would then make the program that uses the
    capability both compiler
    specific _and_ hardware specific.

    So, in other words, if you can't put it in every ISA known to man,
    don't bother making something better than existent ?!?

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++
    threading functionality.

    I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception. The
    ESM doesn't *prevent* interference, but it *detect* interference. Thus nothing is required of other cores, no locks, etc. If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.


    I am assuming the esmLockStore() just unlocks what was previously locked
    and the stores have already happened by that time.

    There is no "locking" in the sense of preventing any accesses.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Dec 8 17:14:11 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across
    buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same
    address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception. The
    ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
    (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined
    range surrounding the target address and the store will fail if any other
    agent has modified any byte within the exclusive range.

    esmINTERFERENCE seems to require multiple of these exclusive blocks
    to cover non-contiguous address ranges, which on first blush leads
    me to worry both about deadlock situations and starvation issues.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:06:34 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>
    You describe in many words and not really to the point what can be >>>>> explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>> atomic operations beyond load and store (even on single-core systems), >>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
    MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>

    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting until the esmLOCKstore instruction.-a Or am I missing something here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Dec 8 20:15:13 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >> >>>>>> guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >> >>>>>> affects the hardware.-a So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency". >> >>>>> If hardware guaranteed sequential consistency, volatile would provide >> >>>>> guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems), >> >>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
    MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >> >>

    The functions below rely on more than that - to make the work, as far as >> > I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction.-a Or am I missing something here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache
    line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    What if two processors have intersecting (but not fully overlapping)
    sets of those 8 cache lines?

    Can you guarantee forward progress?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:20:27 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 06/12/2025 18:44, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
    You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as >> I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction. Or am I missing something here?

    In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.


    That's what I assumed.

    Certainly there are situations where it can be helpful to have longer
    atomic reads and writes. I am not so sure about allowing 8 KB atomic accesses, especially in a system with multiple cores - that sounds like letting user programs DoS everything else on the system.

    These is no bus!

    I think there's a typo or some missing words there?

    There is a fabric based interconnect to transport data-transfer requests
    around the system, where everyone connected to the transport can send
    a new request, receive a response, and receive a SNOOP simultaneously.

    There is NO single point on the fabric one can GRAB and prevent other
    sections of the fabric from "doing their prescribed transport duties.

    There is a memory ordering protocol in L3/DRAM-controller that prevents
    more than one "SNOOP per cache line" from being "in progress" at the
    same time.


    The esmLOCKload causes the <translated> address to be 'monitored'
    for interference, and to announce participation in the ATOMIC event.

    The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
    AND sets up a default control point (This instruction itself) so that
    if interference is detected at esmLOCKstore control is transferred to
    that control point.

    So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.

    If I understand you correctly here, you basically have a "load-reserve / store-conditional" sequence as commonly found in RISC architectures, but
    you have the associated loop built into the hardware?

    In effect, yes. I have a multi-{LoadLocked StoreConditional} scheme
    as found in other RISC architectures with several small/big changes::
    a) you get up to 8 LLs
    b) the last SC causes the rest of the system to see all the memory
    changes at the same time (or nobody sees any changes).
    c) The ATOMIC sequence cannot persist across an exception or interrupt.
    d) only participating memory lines have the ATOMIC property.

    And yes, control transfer is built-into the architecture.

    I can see that potentially improving efficiency, but I also find it very difficult to
    read or write C code that has hidden loops. And I worry about how it
    would all work if another thread on the same core or a different core
    was running similar code in the middle of these sequences. It also
    reduces the flexibility - in some use-cases, you want to have software limits on the number of attempts of a lr/sc loop to detect serious synchronisation problems.

    In this case, said SW would use the Branch-on-interference instruction.


    There is a branch-on-interference instruction that
    a) does what it says,
    b) sets up an alternate atomic control point.

    It is not easy to have atomic or lock mechanisms on multi-core systems
    that are convenient to use, efficient even in the worst cases, and don't >> require additional hardware.

    I am using the "Miss Buffer" as the point of monitoring for interference. a) it already has to monitor "other hits" from outside accesses to deal
    with the coherence mechanism.
    b) that esm additions to Miss Buffer are on the order of 2%

    c) there are other means to strengthen guarantees of forward progress.


    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Of course. It is interesting to speculate about possible features of an >> architecture like yours, but it is not likely to be available to anyone
    else in practice (unless perhaps it can be implemented as an extension
    for RISC-V).

    Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>

    and I don't think that C with just volatile gives you such guarantees. >>>>>>
    - anton



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:30:34 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    <snip>
    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    So, in other words, if you can't put it in every ISA known to man,
    don't bother making something better than existent ?!?

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++
    threading functionality.

    I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

    esmLOCKLoad sets up monitors (in Miss Buffers) that detect Snoops to
    the participating cache lines.

    esmINTERFERENCE sets up a block of code that either executes in its
    entirety or fails in its entirety--and transfers control.

    In "certain circumstances" the code inside the esmINTERFERENCE block
    are allowed to NaK SNOOPs to those lines. So, if interference happens
    this late, you can effectively tell requestor "Yes, I have that cache
    line, No you cannot have it right now".

    If requestor gets a NaK, and requestor was attempting an ATOMIC event,
    the event fails. If requestor was NOT attempting, requestor resubmits
    the request. In both cases, the thread causing the interference is the
    one delayed while the one performing the event has higher probability
    of success.

    I am assuming the esmLockStore() just unlocks what was previously locked
    and the stores have already happened by that time.

    Yes, it is the terminal sentinel.

    It would seem that esmINTERFERENCE() would indicate that everybody with access out to the coherence point has agreed to the locked area? Does
    that require that all devices respect the esmINTERFERENCE()?

    I can see you are getting at something subtle, here. I cannot quite grasp
    what it might be.

    Can you ask the above again but use different words ?!?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:35:01 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing executes as >> a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across >> buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same >> address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception. The
    ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
    (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.

    esmINTERFERENCE seems to require multiple of these exclusive blocks
    to cover non-contiguous address ranges, which on first blush leads
    me to worry both about deadlock situations and starvation issues.

    Over in the Miss Buffer there are (at least) 8 miss buffers. Each miss
    buffer has to monitor inbound messages for requests (SNOOPs) to its
    entry.

    So, each MB entry has a bit to tell if it is participating in an event. esmINTERFERENCE is a way to sample all participating MB entries simul- taneously; and in addition, esmINTERFERENCE is part of what enables
    the NaKing of SNOOP requests.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 21:58:00 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    ERROR "unexpected byte sequence starting at index 736: '\xC2'" while decoding:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that
    affects the hardware.|e-a So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever. >> >>>>>
    You describe in many words and not really to the point what can be >> >>>>> explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.

    However, for concurrent manipulations of data structures, one wants >> >>>>> atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >> >>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32|e-a|e-a DWs|e-a|e-a as a single ATOMIC instruction.
    MM|e-a|e-a|e-a|e-a|e-a can MOV|e-a|e-a up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting >> > until the esmLOCKstore instruction.|e-a Or am I missing something here? >>
    Lock the BUS? Only when shit hits the fan. What about locking the cache >> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >> do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    What if two processors have intersecting (but not fully overlapping)
    sets of those 8 cache lines?

    Can you guarantee forward progress?

    Yes.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 8 16:31:08 2025
    From Newsgroup: comp.arch

    On 12/8/2025 9:14 AM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing executes as >>> a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across >>> buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same >>> address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception. The
    ESM doesn't *prevent* interference, but it *detect* interference. Thus
    nothing is required of other cores, no locks, etc. If they write to a
    "protected" location, the write is allowed, but the core in the ESM is
    notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
    (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.

    Any mutation the reservation granule?




    esmINTERFERENCE seems to require multiple of these exclusive blocks
    to cover non-contiguous address ranges, which on first blush leads
    me to worry both about deadlock situations and starvation issues.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Dec 9 09:13:54 2025
    From Newsgroup: comp.arch

    On 08/12/2025 17:23, Stephen Fuld wrote:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing executes
    as a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including
    across buffers and bus bridges. It would have to go to the memory
    coherence point. Otherwise, some other device using a bridge could
    update the same address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception.-a The
    ESM doesn't *prevent* interference, but it *detect* interference.-a Thus nothing is required of other cores, no locks, etc.-a If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.


    Yes, that is correct (as far as I understand it now). The critical part
    is the hidden hardware loop that was not mentioned or indicated in the original code.

    There are basically two ways to handle atomic operations. One way is to
    use locking mechanisms to ensure that nothing (other cores, interrupts
    or other pre-emption on the same core) can break up the sequence. The
    other way is to have a mechanism to detect conflicts and a failure of
    the atomic operation, so that you can try again (or otherwise handle the situation). (You can, of course, combine these - such as by disabling
    local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence
    my confusion. It turns out that it /does/ have conflict detection and a hardware retry loop, all hidden from anyone trying to understand the
    code. (I can appreciate that there may be benefits in doing this in
    hardware, but there are no benefits in hiding it from the programmer!)


    I am assuming the esmLockStore() just unlocks what was previously
    locked and the stores have already happened by that time.

    There is no "locking" in the sense of preventing any accesses.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 19:15:48 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 08/12/2025 17:23, Stephen Fuld wrote:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing executes
    as a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including
    across buffers and bus bridges. It would have to go to the memory
    coherence point. Otherwise, some other device using a bridge could
    update the same address range in the middle of an update.

    ---------------------------------
    I may be wrong about this, but I think you have a misconception.-a The
    ESM doesn't *prevent* interference, but it *detect* interference.-a Thus nothing is required of other cores, no locks, etc.-a If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.


    Yes, that is correct (as far as I understand it now). The critical part
    is the hidden hardware loop that was not mentioned or indicated in the original code.
    ---------------------------------

    Mostly esm detects interference but there are times when esm is allowed
    to ignore interference.

    Consider a sever scale esm implementation. In such an implementation, esm
    is enhanced with a system* arbiter.

    After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting
    no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.

    At this point the core is in "careful" mode, core becomes sequentially consistent, SW chooses to re-run the event. Here, cache misses leave
    core in program order,... When interference is detected, the event fails
    and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
    If core can determine that all writes to participating memory can be
    performed (at the first participating store) core is allowed to NaK
    lower priority interfering accesses.

    At this point the core is in "Slow and Methodological" mode. Now, after
    all participating cache lines have been touched, all the physical pointers
    are bundled into a message and sent to the system arbiter. System arbiter examines each cache line address and if no-other-core has a reservation
    on ANY of them, then system arbiter installs said reservations, and
    returns "success". At this point, core is allowed to NaK interfering
    accesses. This event WILL SUCCEED. After the event is complete, the
    termination of the event at the core, takes the same bundle of addresses
    and sends it back to system arbiter; who removes them from reservation.

    Optimistic mode takes no more cycles than if the memory references were
    not ATOMIC.

    I should also note:: none of this state is preserved across interrupts
    or exceptions. So, an interrupt or exception causes the event to fail
    prior to control transfer. Interrupts do not care about this control
    transfer. Exception control transfer in My 66000 packs everything the
    exception handler needs in registers, so having IP point at ATOMIC
    control point with the registers setup for page fault does not cause
    exception handler any issues whatsoever.

    There are basically two ways to handle atomic operations. One way is to
    use locking mechanisms to ensure that nothing (other cores, interrupts
    or other pre-emption on the same core) can break up the sequence. The
    other way is to have a mechanism to detect conflicts and a failure of
    the atomic operation, so that you can try again (or otherwise handle the situation). (You can, of course, combine these - such as by disabling
    local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence
    my confusion. It turns out that it /does/ have conflict detection and a hardware retry loop, all hidden from anyone trying to understand the
    code. (I can appreciate that there may be benefits in doing this in hardware, but there are no benefits in hiding it from the programmer!)

    How exactly do you inform the programmer that:

    InBound [Address]
    OutBound [Address]

    operates like::

    try_again:
    InBound [Address]
    BIN try_again
    OutBound [Address]

    And why clutter up asm with extraneous labels and require extra instructions. --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Dec 9 20:51:26 2025
    From Newsgroup: comp.arch

    On 09/12/2025 20:15, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    There are basically two ways to handle atomic operations. One way is to
    use locking mechanisms to ensure that nothing (other cores, interrupts
    or other pre-emption on the same core) can break up the sequence. The
    other way is to have a mechanism to detect conflicts and a failure of
    the atomic operation, so that you can try again (or otherwise handle the
    situation). (You can, of course, combine these - such as by disabling
    local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence
    my confusion. It turns out that it /does/ have conflict detection and a
    hardware retry loop, all hidden from anyone trying to understand the
    code. (I can appreciate that there may be benefits in doing this in
    hardware, but there are no benefits in hiding it from the programmer!)

    How exactly do you inform the programmer that:

    InBound [Address]
    OutBound [Address]

    operates like::

    try_again:
    InBound [Address]
    BIN try_again
    OutBound [Address]

    And why clutter up asm with extraneous labels and require extra instructions.

    The most obvious answer is that in any code that uses these features,
    good comments are essential so that readers can see what is happening.

    Another method would be to use better names for the intrinsics, as seen
    at the C (or other HLL) level. (Assembly instruction names don't matter nearly as much.)

    So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
    of better names, but that would at least give the reader a clue that
    there's something odd going on.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 21:28:47 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 09/12/2025 20:15, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    There are basically two ways to handle atomic operations. One way is to >> use locking mechanisms to ensure that nothing (other cores, interrupts
    or other pre-emption on the same core) can break up the sequence. The
    other way is to have a mechanism to detect conflicts and a failure of
    the atomic operation, so that you can try again (or otherwise handle the >> situation). (You can, of course, combine these - such as by disabling
    local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence
    my confusion. It turns out that it /does/ have conflict detection and a >> hardware retry loop, all hidden from anyone trying to understand the
    code. (I can appreciate that there may be benefits in doing this in
    hardware, but there are no benefits in hiding it from the programmer!)

    How exactly do you inform the programmer that:

    InBound [Address]
    OutBound [Address]

    operates like::

    try_again:
    InBound [Address]
    BIN try_again
    OutBound [Address]

    And why clutter up asm with extraneous labels and require extra instructions.

    The most obvious answer is that in any code that uses these features,
    good comments are essential so that readers can see what is happening.

    Another method would be to use better names for the intrinsics, as seen
    at the C (or other HLL) level. (Assembly instruction names don't matter nearly as much.)

    So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
    of better names, but that would at least give the reader a clue that
    there's something odd going on.

    This is a useful suggestion; thanks.

    On the other hand, there are some non-vonNeumann actions lurking within
    esm. Where vonNeumann means: that every instruction is executed in its
    entirety before the next instruction appears to start executing.

    1st:: one cannot single step through an ATMOIC event, if you enter an
    ATOMIC event in single-step mode, you will see the 1st instruction in
    the event, than you will receive control after the terminal instruction
    has executed.

    2nd::the only way to debug an event is to have a buffer of SW locations
    that gets written with non-participating STs. Unlike participating
    memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
    event; whereas the participating lines are either all written instan-
    taneously or not modified at all.

    So, here we have non-participating STs having been written and older participating STs have not.

    3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
    is based on the code in the event.

    4th:: one cannot test esm with a random code generator, since the probability that the random code generator creates a legal esm event is exceedingly low. --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Dec 9 13:55:12 2025
    From Newsgroup: comp.arch

    On 12/9/2025 11:15 AM, MitchAlsup wrote:

    snip


    Mostly esm detects interference but there are times when esm is allowed
    to ignore interference.

    Consider a sever scale esm implementation. In such an implementation, esm
    is enhanced with a system* arbiter.

    After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting
    no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
    At this point the core is in "careful" mode,

    I am missing some understanding here, about this "counter". This
    paragraph seems to indicate that after one failure, the core goes into "careful" mode, but if that were true, you wouldn't need a "counter",
    just a mode bit. So assuming it is a counter and you need "n" failures
    in a row to go into careful mode, is "n" hardwired or settable by
    software? What are the tradeoffs for smaller or larger values of "n"?

    core becomes sequentially
    consistent, SW chooses to re-run the event. Here, cache misses leave
    core in program order,... When interference is detected, the event fails
    and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
    If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
    lower priority interfering accesses.

    Again, after a single failure in careful mode or n failures? If n, is
    it the same value of n as for the transition from optimistic to careful
    mode? Same questions as before about who sets the value and is it
    software changeable?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 22:52:31 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/9/2025 11:15 AM, MitchAlsup wrote:

    snip


    Mostly esm detects interference but there are times when esm is allowed
    to ignore interference.

    Consider a sever scale esm implementation. In such an implementation, esm is enhanced with a system* arbiter.

    After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
    At this point the core is in "careful" mode,

    I am missing some understanding here, about this "counter". This
    paragraph seems to indicate that after one failure, the core goes into "careful" mode, but if that were true, you wouldn't need a "counter",
    just a mode bit. So assuming it is a counter and you need "n" failures
    in a row to go into careful mode, is "n" hardwired or settable by
    software? What are the tradeoffs for smaller or larger values of "n"?

    2-bits; 3-states--not part of save thread state.

    core becomes sequentially
    consistent, SW chooses to re-run the event. Here, cache misses leave
    core in program order,... When interference is detected, the event fails and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
    If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
    lower priority interfering accesses.

    Again, after a single failure in careful mode or n failures? If n, is
    it the same value of n as for the transition from optimistic to careful mode? Same questions as before about who sets the value and is it
    software changeable?

    3-state counter::

    00 -> Optimistic
    01 -> Careful
    10 -> Slow and methodological

    success -> counter = 00;
    failure -> counter++;
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Dec 10 10:07:19 2025
    From Newsgroup: comp.arch

    On 09/12/2025 22:28, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 09/12/2025 20:15, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of
    the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the
    code. (I can appreciate that there may be benefits in doing this in
    hardware, but there are no benefits in hiding it from the programmer!)

    How exactly do you inform the programmer that:

    InBound [Address]
    OutBound [Address]

    operates like::

    try_again:
    InBound [Address]
    BIN try_again
    OutBound [Address]

    And why clutter up asm with extraneous labels and require extra instructions.

    The most obvious answer is that in any code that uses these features,
    good comments are essential so that readers can see what is happening.

    Another method would be to use better names for the intrinsics, as seen
    at the C (or other HLL) level. (Assembly instruction names don't matter
    nearly as much.)

    So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
    "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
    of better names, but that would at least give the reader a clue that
    there's something odd going on.

    This is a useful suggestion; thanks.

    I can certainly say they would help /me/ understand the code, so maybe
    they would help other people understand it too.


    On the other hand, there are some non-vonNeumann actions lurking within
    esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.


    That's a rather different use of the term "vonNeumann" from anything I
    have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
    And are we thinking about the instructions purely from the viewpoint of
    the cpu executing them?

    IME, most instructions on most processors are indivisible, but most
    processors have some instructions that are not. For example, processors
    can have load/store multiple instructions that are interruptable - in
    some cases, after returning from the interrupt (and any associated
    thread context switches) the instructions are restarted, in other cases
    they are continued.

    But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming,
    pipelining, speculative execution, dependency tracking, and all the rest
    of it.

    1st:: one cannot single step through an ATMOIC event, if you enter an
    ATOMIC event in single-step mode, you will see the 1st instruction in
    the event, than you will receive control after the terminal instruction
    has executed.


    That is presumably a choice you made for the debugging features of the
    device.

    2nd::the only way to debug an event is to have a buffer of SW locations
    that gets written with non-participating STs. Unlike participating
    memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
    event; whereas the participating lines are either all written instan- taneously or not modified at all.

    So, here we have non-participating STs having been written and older participating STs have not.

    3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
    is based on the code in the event.


    OK. I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,
    or add SW tracking of retry counts for metrics). My main concern was
    the disconnect between how the code was written and what it actually does.

    4th:: one cannot test esm with a random code generator, since the probability that the random code generator creates a legal esm event is exceedingly low.


    Testing and debugging any kind of locking or atomic access solution is
    always very difficult. You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Dec 10 08:51:16 2025
    From Newsgroup: comp.arch

    On 12/10/2025 1:07 AM, David Brown wrote:
    On 09/12/2025 22:28, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 09/12/2025 20:15, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    There are basically two ways to handle atomic operations.-a One way >>>>> is to
    use locking mechanisms to ensure that nothing (other cores, interrupts >>>>> or other pre-emption on the same core) can break up the sequence.-a The >>>>> other way is to have a mechanism to detect conflicts and a failure of >>>>> the atomic operation, so that you can try again (or otherwise
    handle the
    situation).-a (You can, of course, combine these - such as by disabling >>>>> local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms,
    hence
    my confusion.-a It turns out that it /does/ have conflict detection >>>>> and a
    hardware retry loop, all hidden from anyone trying to understand the >>>>> code.-a (I can appreciate that there may be benefits in doing this in >>>>> hardware, but there are no benefits in hiding it from the programmer!) >>>>
    How exactly do you inform the programmer that:

    -a-a-a-a-a-a-a-a InBound-a-a [Address]
    -a-a-a-a-a-a-a-a OutBound-a [Address]

    operates like::

    try_again:
    -a-a-a-a-a-a-a-a InBound-a-a [Address]
    -a-a-a-a-a-a-a-a BIN-a-a-a-a-a-a try_again
    -a-a-a-a-a-a-a-a OutBound-a [Address]

    And why clutter up asm with extraneous labels and require extra
    instructions.

    The most obvious answer is that in any code that uses these features,
    good comments are essential so that readers can see what is happening.

    Another method would be to use better names for the intrinsics, as seen
    at the C (or other HLL) level.-a (Assembly instruction names don't matter >>> nearly as much.)

    So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
    "load_and_set_retry_point()" and "store_or_retry()".-a Feel free to think >>> of better names, but that would at least give the reader a clue that
    there's something odd going on.

    This is a useful suggestion; thanks.

    I can certainly say they would help /me/ understand the code, so maybe
    they would help other people understand it too.


    On the other hand, there are some non-vonNeumann actions lurking within
    esm. Where vonNeumann means: that every instruction is executed in its
    entirety before the next instruction appears to start executing.


    That's a rather different use of the term "vonNeumann" from anything I
    have heard.-a I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
    And are we thinking about the instructions purely from the viewpoint of
    the cpu executing them?

    IME, most instructions on most processors are indivisible, but most processors have some instructions that are not.-a For example, processors can have load/store multiple instructions that are interruptable - in
    some cases, after returning from the interrupt (and any associated
    thread context switches) the instructions are restarted, in other cases
    they are continued.

    But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing.-a Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
    of it.

    1st:: one cannot single step through an ATMOIC event, if you enter an
    ATOMIC event in single-step mode, you will see the 1st instruction in
    the event, than you will receive control after the terminal instruction
    has executed.


    That is presumably a choice you made for the debugging features of the device.

    2nd::the only way to debug an event is to have a buffer of SW locations
    that gets written with non-participating STs. Unlike participating
    memory lines, these locations will be written--but not in a sequentially
    consistent manner (architecturally), and can be examined outside the
    event; whereas the participating lines are either all written instan-
    taneously or not modified at all.

    So, here we have non-participating STs having been written and older
    participating STs have not.

    3rd:: control transfer not under SW control--more like exceptions and
    interrupts than Br-condition--except that the target of control transfer
    is based on the code in the event.


    OK.-a I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,

    Yes, but. ISTM there is a hardware limit on the number of retries - it
    is two retries, as the third try (second retry) is guaranteed to
    succeed, albeit at a higher cost (in time and interference with other threads/processes) compared to the earlier tries.


    or add SW tracking of retry counts for metrics).

    Again, ISTM that you could do some software tracking by using non participating stores within the locked area to save information outside
    the locked area. I haven't thought through the cost benefit of this,
    how much to save, etc.

    But I am not sure that the "escalation" to a more "intrusive" mechanism
    upon a single failure is optimal. Perhaps it would be better to retry
    once or twice using the current mechanism. I don't have a good feeling
    for what is optimal here, and to what extent the optimal choice would be workload dependent.


    My main concern was
    the disconnect between how the code was written and what it actually does.

    4th:: one cannot test esm with a random code generator, since the
    probability
    that the random code generator creates a legal esm event is
    exceedingly low.


    Testing and debugging any kind of locking or atomic access solution is always very difficult.

    Yup!


    You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    :-)
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 10 20:10:43 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 09/12/2025 22:28, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 09/12/2025 20:15, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of >>>> the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the >>>> code. (I can appreciate that there may be benefits in doing this in >>>> hardware, but there are no benefits in hiding it from the programmer!) >>>
    How exactly do you inform the programmer that:

    InBound [Address]
    OutBound [Address]

    operates like::

    try_again:
    InBound [Address]
    BIN try_again
    OutBound [Address]

    And why clutter up asm with extraneous labels and require extra instructions.

    The most obvious answer is that in any code that uses these features,
    good comments are essential so that readers can see what is happening.

    Another method would be to use better names for the intrinsics, as seen
    at the C (or other HLL) level. (Assembly instruction names don't matter >> nearly as much.)

    So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
    "load_and_set_retry_point()" and "store_or_retry()". Feel free to think >> of better names, but that would at least give the reader a clue that
    there's something odd going on.

    This is a useful suggestion; thanks.

    I can certainly say they would help /me/ understand the code, so maybe
    they would help other people understand it too.


    On the other hand, there are some non-vonNeumann actions lurking within esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.


    That's a rather different use of the term "vonNeumann" from anything I
    have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
    And are we thinking about the instructions purely from the viewpoint of
    the cpu executing them?

    An ATOMIC event is a series of instructions that appear to be performed
    all at once--as if the whole series was "indivisible".

    IME, most instructions on most processors are indivisible, but most processors have some instructions that are not. For example, processors
    can have load/store multiple instructions that are interruptable - in
    some cases, after returning from the interrupt (and any associated
    thread context switches) the instructions are restarted, in other cases
    they are continued.

    Go in the other direction, where a series of instructions HAS TO APPEAR
    as if executed instantaneously.

    But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
    of it.

    None of those things is ARHICTECTURAL--esm is an architectural window into
    how to program ATOMIC events such no future generation of the ISA has to continuously add more synchronization instructions. One can program every known industrial and academic synchronization primitive in esm without ever adding new synchronization instructions.

    1st:: one cannot single step through an ATMOIC event, if you enter an ATOMIC event in single-step mode, you will see the 1st instruction in
    the event, than you will receive control after the terminal instruction
    has executed.


    That is presumably a choice you made for the debugging features of the device.

    No it is the nature of executing a series of instructions as if instantaneously.

    2nd::the only way to debug an event is to have a buffer of SW locations that gets written with non-participating STs. Unlike participating
    memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
    event; whereas the participating lines are either all written instan- taneously or not modified at all.

    So, here we have non-participating STs having been written and older participating STs have not.

    3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
    is based on the code in the event.


    OK. I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,
    or add SW tracking of retry counts for metrics).

    esm attempts to allow SW to program with features previously available
    only at the -|Code level. -|Code allows for many -|instructions to execute before/between any real instructions.

    My main concern was
    the disconnect between how the code was written and what it actually does.

    There is a 26 page specification the programmer needs to read and understand. This includes things we have not talked about--such as::
    a) terminating an event without writing anything
    b) proactively minimizing future interference
    c) modifications to cache coherence model
    at the architectural level.

    The architectural specification allows for various scales of -|Architecture
    to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order -|Controller are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers
    the whole gamut.

    4th:: one cannot test esm with a random code generator, since the probability
    that the random code generator creates a legal esm event is exceedingly low.


    Testing and debugging any kind of locking or atomic access solution is always very difficult. You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    Right at Christmas time !! {Ask me how I know}.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 11 20:26:09 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 10/12/2025 21:10, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    OK. I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,
    or add SW tracking of retry counts for metrics).

    esm attempts to allow SW to program with features previously available
    only at the -|Code level. -|Code allows for many -|instructions to execute before/between any real instructions.

    My main concern was
    the disconnect between how the code was written and what it actually does.


    Perhaps it would be better to think of these sequences in assembler
    rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    There is a 26 page specification the programmer needs to read and understand.
    This includes things we have not talked about--such as::
    a) terminating an event without writing anything
    b) proactively minimizing future interference
    c) modifications to cache coherence model
    at the architectural level.

    Fair enough. This is not a minor or simple feature!

    No, it is a design that allows for ISA to remain static while all sorts of synchronization stuff gets written, tested, and tuned.


    The architectural specification allows for various scales of -|Architecture to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order -|Controller are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers the whole gamut.

    4th:: one cannot test esm with a random code generator, since the probability
    that the random code generator creates a legal esm event is exceedingly low.


    Testing and debugging any kind of locking or atomic access solution is
    always very difficult. You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    Right at Christmas time !! {Ask me how I know}.

    We can gather round the fire, and Grampa can settle in his rocking chair
    to tell us war stories from the olden days :-)

    A good story is always nice, so go for it!

    Year:: 1997, time 7 days before Christmas:: situation, Customer is
    having (and has had) strange bugs that happen about once a week.
    Customer is unhappy, we have had a senior engineer on sight for
    4 months without forward progress. We were told "You don't come home
    until the problem is fixed".

    System:: 2 (or more) of our cache coherent motherboards, connected
    with a proven cache coherent buss.

    On the flight from Austin to Manchester England, I decide that what
    we have is a physics experiment. So, when we arrive, I had their SW
    guy code up a routine that as soon as it got a time slice, it would
    signal it no longer needed time. While we hooked up the logic analyzer
    to our motherboards and to their bus. When SW was ready (about 30 minutes)
    we tried the case--Instantly, the time delay between the bug showing up
    went from once a week to milliseconds. We spent the afternoon taking
    logic analyzer traces, and went to dinner.

    The next day, we went through the traces with a fine tooth comb and
    found a smoking gun--so we ran more experiments and this same smoking
    gun was found in each track. After a couple of hours, we found that
    their proven coherent bus was allowing 1 single cycle where our bus
    could be seen in an inconsistent state. and it was only a dozen
    cycles downstream that the crash was transpiring.

    It turns out that their bus was only coherent when the attached bus
    was slower than 4 cycles to do "random coherent message", whereas
    our bus was times at 2 cycles for this response.

    So, we took their FPGA which ran the bus apart and found out how to
    delay one signal, reprogrammed it--ONLY to run into another message
    that was off by 1 or 2 cycles. This one took a whole day to find and
    program around.

    We both made it home for Christmas, and in some part saved the company...

    (We once had a system where there was a bug that not only triggered only
    at the customer's site, but did so only on the 30th of September. It
    took years before we made the connection to the date and found the bug.)


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Dec 11 20:47:12 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    What _would_ be useful on occasion would be an assembler which
    could do register assignment, for example for a small function.
    It would be OK if this were to issue an error if there were too
    many variables for assignment.

    Does anybody know of such a beast?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Dec 11 23:51:26 2025
    From Newsgroup: comp.arch

    On Thu, 11 Dec 2025 20:26:09 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    We both made it home for Christmas, and in some part saved the
    company...


    Not for long so... Was not it dead anyway in the 6-7 months?


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:00:53 2025
    From Newsgroup: comp.arch

    On 12/10/2025 1:07 AM, David Brown wrote:
    [...]
    Testing and debugging any kind of locking or atomic access solution is always very difficult.-a You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    Murphy's Law. Actually, have you ever messed around with Relacy Race
    Detector? Its pretty interesting.


    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:03:29 2025
    From Newsgroup: comp.arch

    On 12/11/2025 3:02 PM, Chris M. Thomasson wrote:
    On 12/11/2025 1:05 AM, David Brown wrote:
    On 10/12/2025 21:10, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    OK.-a I can see the advantages of that - though there are disadvantages >>>> too (such as being unable to control a limit on the number of retries, >>>> or add SW tracking of retry counts for metrics).

    esm attempts to allow SW to program with features previously available
    only at the -|Code level. -|Code allows for many -|instructions to execute >>> before/between any real instructions.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a My main concern was
    the disconnect between how the code was written and what it actually
    does.


    Perhaps it would be better to think of these sequences in assembler
    rather than C - you want tighter control than C normally allows, and
    you don't want optimisers re-arranging things too much.

    Right. Way back before C/C++ 11 I would code all of my sensitive lock/ wait-free code in assembly.

    [...]




    Actually, I would turn off link-time optimization back then.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:02:40 2025
    From Newsgroup: comp.arch

    On 12/11/2025 1:05 AM, David Brown wrote:
    On 10/12/2025 21:10, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    OK.-a I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,
    or add SW tracking of retry counts for metrics).

    esm attempts to allow SW to program with features previously available
    only at the -|Code level. -|Code allows for many -|instructions to execute >> before/between any real instructions.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a My main concern was
    the disconnect between how the code was written and what it actually
    does.


    Perhaps it would be better to think of these sequences in assembler
    rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.

    Right. Way back before C/C++ 11 I would code all of my sensitive lock/wait-free code in assembly.

    [...]



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 12 08:59:12 2025
    From Newsgroup: comp.arch

    On 11/12/2025 22:51, Michael S wrote:
    On Thu, 11 Dec 2025 20:26:09 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    We both made it home for Christmas, and in some part saved the
    company...


    Not for long so... Was not it dead anyway in the 6-7 months?


    This is why stories end with "they all lived happily ever after", and
    why sequel movies are almost always terrible! I liked the first story
    better.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:37:03 2025
    From Newsgroup: comp.arch

    On 12/8/2025 12:06 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>>> guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >>>>>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>>>>> but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>
    You describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>>

    The functions below rely on more than that - to make the work, as far as >>> I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction.-a Or am I missing something here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache
    line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
    things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system.
    or cmpxchg16b on a 64-bit system.

    People would bend over backwards to get a DCAS, or NCAS. It would be
    infested with strange indirection ala d"escriptors", and involved a shit
    load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:39:16 2025
    From Newsgroup: comp.arch

    On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:
    On 12/8/2025 12:06 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enough
    guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing >>>>>>>>> that
    affects the hardware.-a So volatile writes are ordered at the C >>>>>>>>> level,
    but that says nothing about how they might progress through >>>>>>>>> storage
    queues, caches, inter-processor communication buses, or whatever. >>>>>>>>
    You describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory
    ordering
    on hardware with weaker memory ordering than sequential
    consistency".
    If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
    guarantees that are as good on multi-core machines as on single- >>>>>>>> core
    machines.

    However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core
    systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>>>

    The functions below rely on more than that - to make the work, as
    far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction.-a Or am I missing something here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache
    line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
    things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system.
    or cmpxchg16b on a 64-bit system.

    People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

    Have you ever read about KCSS?

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs

    https://patents.google.com/patent/US7293143
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:47:50 2025
    From Newsgroup: comp.arch

    On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:
    On 12/8/2025 12:06 PM, MitchAlsup wrote:
    [...]
    People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

    I am trying to convey that a lot of neat algos do not even need the
    fancy DCAS, NCAS.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 12 23:39:53 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:
    On 12/8/2025 12:06 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enough
    guarantees for multi-threaded coding on multi-core systems. >>>>>>>>> Basically,
    it only works at the C abstract machine level - it does nothing >>>>>>>>> that
    affects the hardware.-a So volatile writes are ordered at the C >>>>>>>>> level,
    but that says nothing about how they might progress through >>>>>>>>> storage
    queues, caches, inter-processor communication buses, or whatever. >>>>>>>>
    You describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory >>>>>>>> ordering
    on hardware with weaker memory ordering than sequential
    consistency".
    If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
    guarantees that are as good on multi-core machines as on single- >>>>>>>> core
    machines.

    However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core >>>>>>>> systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as
    far as
    I can see, you need the first "esmLOCKload" to lock the bus and also >>>> lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction.-a Or am I missing something here? >>>
    Lock the BUS? Only when shit hits the fan. What about locking the cache >>> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >>> do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
    things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system. or cmpxchg16b on a 64-bit system.

    People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

    Have you ever read about KCSS?

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs

    https://patents.google.com/patent/US7293143

    While I was not directly exposed to KCSS, I was exposed to the underlying
    need for multi-location Compare and Swap requirements, and provided a means
    to implement same in both ASF and ESM. {All of us (synchronization people)
    were so exposed. And a lot of academic ideas came out of those trends, too.}

    In my case, I simply wanted a way "out" of inventing a new synchronization primitive ever ISA generation. What my solution entails is a modification
    to the cache coherence model (NaK) that indicates "Yes I have the line you referenced, but, no you can't have it right now" in order to strengthen
    the guarantees of forward progress.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 15:52:45 2025
    From Newsgroup: comp.arch

    On 12/6/2025 11:04 AM, Scott Lurndal wrote:
    [...]
    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++ threading functionality.

    Right. However, a DWCAS is important as well... Well, for me... This
    only works on contiguous words.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 15:56:52 2025
    From Newsgroup: comp.arch

    On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
    On 12/8/2025 9:14 AM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing
    executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including
    across
    buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the
    same
    address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception.-a The
    ESM doesn't *prevent* interference, but it *detect* interference.-a Thus >>> nothing is required of other cores, no locks, etc.-a If they write to a
    "protected" location, the write is allowed, but the core in the ESM is
    notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
    (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other
    agent has modified any byte within the exclusive range.

    Any mutation the reservation granule?

    I forgot if a load from the reservation granule would cause a LL/SC to
    fail. I know a store would. False sharing in poorly written programs
    would cause it to occur. LL/SC experiencing live lock. This was back in
    my PPC days.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Dec 13 09:31:05 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    What my solution entails is a modification
    to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen
    the guarantees of forward progress.

    How does it strengthen the guarantees of forward progress? My guess:
    If the requester itself is in an atomic sequence B, it will cancel it.
    This could help if the atomic sequence A that caused the NaK then
    tries to get a cache line that would be kept by B.

    There is still a chance of both sequences canceling each other by
    sending NaKs at the same time, but it is smaller and with something
    like exponential backoff eventual forward progress could be achieved.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 19:03:07 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
    On 12/8/2025 9:14 AM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing
    executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including
    across
    buffers and bus bridges. It would have to go to the memory coherence >>>> point. Otherwise, some other device using a bridge could update the >>>> same
    address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception.-a The >>> ESM doesn't *prevent* interference, but it *detect* interference.-a Thus >>> nothing is required of other cores, no locks, etc.-a If they write to a >>> "protected" location, the write is allowed, but the core in the ESM is >>> notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
    (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other >> agent has modified any byte within the exclusive range.

    Any mutation the reservation granule?

    I forgot if a load from the reservation granule would cause a LL/SC to
    fail. I know a store would. False sharing in poorly written programs
    would cause it to occur. LL/SC experiencing live lock. This was back in
    my PPC days.

    A LD to the granule would cause loss of write permission, causing a long
    delay to perform SC and greatly increase the probability of interference.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 19:12:28 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    What my solution entails is a modification
    to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen >the guarantees of forward progress.

    How does it strengthen the guarantees of forward progress?

    The allowance of a NaK is only available under somewhat special
    circumstances::
    a) in Careful mode:: when core can see that all STs have write permission
    and data is present, NaKs allow the Modification part to run to
    completion.
    b) In Slow and Methodical mode:: core can NaK any access to any of its
    cache lines--preventing interference.

    My guess:
    If the requester itself is in an atomic sequence B, it will cancel it.

    Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
    the event by the time the innocent request shows up again.

    This could help if the atomic sequence A that caused the NaK then
    tries to get a cache line that would be kept by B.

    There is still a chance of both sequences canceling each other by
    sending NaKs at the same time, but it is smaller and with something
    like exponential backoff eventual forward progress could be achieved.

    Instead of some contrived back-off policy--at the failure point one can
    read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be.
    So, if you are going after a unit of work, you march down the queue WHY
    units and then YOU are guaranteed that YOU are the only one after that
    unit of work.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Dec 13 11:49:46 2025
    From Newsgroup: comp.arch

    On 12/13/2025 11:03 AM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
    On 12/8/2025 9:14 AM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing
    executes as
    a unit. It would seem to me that the address range(s) needing to be >>>>>> locked would have to be supplied throughout the system, including
    across
    buffers and bus bridges. It would have to go to the memory coherence >>>>>> point. Otherwise, some other device using a bridge could update the >>>>>> same
    address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception.-a The >>>>> ESM doesn't *prevent* interference, but it *detect* interference.-a Thus >>>>> nothing is required of other cores, no locks, etc.-a If they write to a >>>>> "protected" location, the write is allowed, but the core in the ESM is >>>>> notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor" >>>> (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined >>>> range surrounding the target address and the store will fail if any other >>>> agent has modified any byte within the exclusive range.

    Any mutation the reservation granule?

    I forgot if a load from the reservation granule would cause a LL/SC to
    fail. I know a store would. False sharing in poorly written programs
    would cause it to occur. LL/SC experiencing live lock. This was back in
    my PPC days.

    A LD to the granule would cause loss of write permission, causing a long delay to perform SC and greatly increase the probability of interference.

    So, you need to create a rule. If you program for my system, you MUST
    make sure that everything is properly aligned and padded. Been there,
    done that. Now, think of nefarious agents... I was able to cause damage
    to a simple strong CAS loop with another thread(s) mutating the cache
    line on purpose, as a stress test... CAS would start hitting higher and
    higher failure rates, and finally, hit the BUS to ensure some sort of
    forward progress.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Dec 13 11:46:17 2025
    From Newsgroup: comp.arch

    On 12/13/2025 11:12 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    What my solution entails is a modification
    to the cache coherence model (NaK) that indicates "Yes I have the line you >>> referenced, but, no you can't have it right now" in order to strengthen
    the guarantees of forward progress.

    How does it strengthen the guarantees of forward progress?

    The allowance of a NaK is only available under somewhat special circumstances::
    a) in Careful mode:: when core can see that all STs have write permission
    and data is present, NaKs allow the Modification part to run to
    completion.
    b) In Slow and Methodical mode:: core can NaK any access to any of its
    cache lines--preventing interference.

    My guess:
    If the requester itself is in an atomic sequence B, it will cancel it.

    Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
    the event by the time the innocent request shows up again.

    This could help if the atomic sequence A that caused the NaK then
    tries to get a cache line that would be kept by B.

    There is still a chance of both sequences canceling each other by
    sending NaKs at the same time, but it is smaller and with something
    like exponential backoff eventual forward progress could be achieved.

    Instead of some contrived back-off policy--at the failure point one can
    read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be.
    So, if you are going after a unit of work, you march down the queue WHY
    units and then YOU are guaranteed that YOU are the only one after that
    unit of work.

    Step one. Make sure that a failure means another thread made progress.
    strong CAS does this. Don't let it spuriously fail where nothing makes progress... ;^o

    Oh my we got a load on the reservation granule, abort all LL/SC in
    progress wrt that granule. Of course this assumes that the user that
    created the program for it gets things right. For a LL/SC on the PPC it definitely helps where things are aligned and padded up to a reservation granule, not just a l2 cache line. Helps mitigate false sharing causing livelock.

    Even in weak CAS, akin to LL/SC. Well, how sensitive is that reservation granule. Can a simple load cause a failure?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 21:58:07 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/13/2025 11:12 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    What my solution entails is a modification
    to the cache coherence model (NaK) that indicates "Yes I have the line you
    referenced, but, no you can't have it right now" in order to strengthen >>> the guarantees of forward progress.

    How does it strengthen the guarantees of forward progress?

    The allowance of a NaK is only available under somewhat special circumstances::
    a) in Careful mode:: when core can see that all STs have write permission
    and data is present, NaKs allow the Modification part to run to
    completion.
    b) In Slow and Methodical mode:: core can NaK any access to any of its
    cache lines--preventing interference.

    My guess:
    If the requester itself is in an atomic sequence B, it will cancel it.

    Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
    the event by the time the innocent request shows up again.

    This could help if the atomic sequence A that caused the NaK then
    tries to get a cache line that would be kept by B.

    There is still a chance of both sequences canceling each other by
    sending NaKs at the same time, but it is smaller and with something
    like exponential backoff eventual forward progress could be achieved.

    Instead of some contrived back-off policy--at the failure point one can read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be. So, if you are going after a unit of work, you march down the queue WHY units and then YOU are guaranteed that YOU are the only one after that
    unit of work.

    Step one. Make sure that a failure means another thread made progress. strong CAS does this. Don't let it spuriously fail where nothing makes progress... ;^o

    Absollutely!

    WHY is only valid in "slow and methodological" which has strong guarantees
    of forward progress--at least 1 thread is making forward progress in S&M.

    Spurious has to do with things like "system arbiter buffer overflow" and
    is not related to exceptions or interrupts.

    Oh my we got a load on the reservation granule, abort all LL/SC in
    progress wrt that granule. Of course this assumes that the user that
    created the program for it gets things right.

    This is why I created NaK in the cache coherence protocol--to strengthen
    the guarantee of forward progress.

    For a LL/SC on the PPC it definitely helps where things are aligned and padded up to a reservation granule, not just a l2 cache line. Helps mitigate false sharing causing livelock.

    Even in weak CAS, akin to LL/SC. Well, how sensitive is that reservation granule. Can a simple load cause a failure?

    Innocent LD gets NaKed causing the innocent thread to waste time while
    allowing the ATOMIC event to make forward progress.

    In my case reservation granule is a cache line {which is the same across
    the memory hierarchy--but still allows for implementation defined size}.

    For example:: HBM can deliver 1024-bits (soon 2048-bits) in a single beat,
    so, for main_memory == HBM it makes sense to align the size of the LLcache
    to the width of HBM. Once in LLC, you can parcel it out any way your system prescribes.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 22:03:16 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/13/2025 11:03 AM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
    On 12/8/2025 9:14 AM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing
    executes as
    a unit. It would seem to me that the address range(s) needing to be >>>>>> locked would have to be supplied throughout the system, including >>>>>> across
    buffers and bus bridges. It would have to go to the memory coherence >>>>>> point. Otherwise, some other device using a bridge could update the >>>>>> same
    address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception.-a The >>>>> ESM doesn't *prevent* interference, but it *detect* interference.-a Thus
    nothing is required of other cores, no locks, etc.-a If they write to a >>>>> "protected" location, the write is allowed, but the core in the ESM is >>>>> notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor" >>>> (the basis of the Store-Exclusive/Load-Exclusive instructions, which >>>> mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined
    range surrounding the target address and the store will fail if any other
    agent has modified any byte within the exclusive range.

    Any mutation the reservation granule?

    I forgot if a load from the reservation granule would cause a LL/SC to
    fail. I know a store would. False sharing in poorly written programs
    would cause it to occur. LL/SC experiencing live lock. This was back in
    my PPC days.

    A LD to the granule would cause loss of write permission, causing a long delay to perform SC and greatly increase the probability of interference.

    So, you need to create a rule. If you program for my system, you MUST
    make sure that everything is properly aligned and padded. Been there,
    done that.
    Now, think of nefarious agents... I was able to cause damage
    to a simple strong CAS loop with another thread(s) mutating the cache
    line on purpose, as a stress test... CAS would start hitting higher and higher failure rates, and finally, hit the BUS to ensure some sort of forward progress.

    This is why NaKing the interference works better. The interfering agent
    takes the timing hit while the ATOMIC event has higher probability of
    success.

    Also Note: esm is not subject to ABA problem at all--because any interrupt
    or exception causes the event to terminate prior to control transfer.

    And this is ALSO why there is no thread state associated with esm --
    excepting the 16-bit WHY value which is only set if/when there are
    no {E,I} control transfers.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Dec 14 05:13:46 2025
    From Newsgroup: comp.arch

    I am just noticing that the actual physical register name is not needed
    until lookup at the reservation stations. In Qupls4 it can be a few
    clock cycles before the register lookup is done. So, an incorrect one
    could be supplied at the rename stage; it only has to be good enough to
    work out dependencies. Would a sequence number based register name work? (Rather than reading a fifo). Then it is a matter of correcting it later.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Dec 15 12:30:09 2025
    From Newsgroup: comp.arch

    Sure. Of course multi-core systems will not have that hardware guarantee,
    at least not on main memory, for performance reasons.

    SGI's big MIPS supercomputers did, tho. So maybe they could again in
    some future.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 16 19:47:56 2025
    From Newsgroup: comp.arch

    On Wed, 12 Nov 2025 11:47:34 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Tue, 11 Nov 2025 21:34:08 -0600
    BGB <cr88192@gmail.com> wrote:


    Going to/from 128-bit integer adds a few "there be dragons here"
    issues regarding performance.


    Not really.
    That is, conversions are not blazingly fast, but still much better
    than any attempt to divide in any form of decimal. And helps to
    preserve your sanity.
    There is also psychological factor at play - your users expect
    division and square root to be slower than other primitive FP
    operations, so they are not disappointed. Possibly they are even
    pleasantly surprised, when they find out that the difference in
    throughput between division and multiplication is smaller than factor
    20-30 that they were accustomed to for 'double' on their 20 y.o. Intel
    and AMD.


    Today I tested speed of gcc implementation of Decimal128 (BID-encoded,
    of course) on Intel Core i7-14700.
    Average time in nsec:
    op Add Sub Mul Div
    P-Core 33 33 86 76
    E-Core 46 48 121 108

    Counter-intuitively, division is faster than multiplications.
    And both appear much slower than necessary.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Dec 16 17:51:28 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:

    Today I tested speed of gcc implementation of Decimal128 (BID-encoded,
    of course) on Intel Core i7-14700.
    Average time in nsec:
    op Add Sub Mul Div
    P-Core 33 33 86 76
    E-Core 46 48 121 108

    Counter-intuitively, division is faster than multiplications.
    And both appear much slower than necessary.

    Interesting. Could you provide the benchmark used?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 16 20:43:49 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    I am just noticing that the actual physical register name is not needed until lookup at the reservation stations. In Qupls4 it can be a few
    clock cycles before the register lookup is done. So, an incorrect one
    could be supplied at the rename stage; it only has to be good enough to
    work out dependencies. Would a sequence number based register name work? (Rather than reading a fifo). Then it is a matter of correcting it later.


    The renamer gives you 2 pieces of information::
    a) where
    b) when
    There are various implementations that define where and when differently,
    but the important thing is that you KNOW that they represent where and when.

    Where: Where is the physical register name (location) which can be in the register file(s), the data path, the reorder buffer(s), or instruction stations.

    When has bet a few states:: states prior to when a dependent instruction
    can be launched and capture this result, a couple of states when the inst
    can be launched ..., states after result has landed in RoB, and a state indicating the value is in the register file.

    Where and when interact.

    Given where and when one can organize the instruction queueing, data path forwarding, and the pipelining of operands and results. -------------------------------------------------------
    Me, personally, I like a physical register file
    a) which is logically indexed for reads,
    b) performs renaming and operand access simultaneously,
    c) which is physically indexed for writes,
    d) is "repaired" with a history table of valid bits.
    e) so, mispredict repair has 0 cycles of latency.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 17 12:02:12 2025
    From Newsgroup: comp.arch

    On Tue, 16 Dec 2025 17:51:28 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    Today I tested speed of gcc implementation of Decimal128
    (BID-encoded, of course) on Intel Core i7-14700.
    Average time in nsec:
    op Add Sub Mul Div
    P-Core 33 33 86 76
    E-Core 46 48 121 108

    Counter-intuitively, division is faster than multiplications.
    And both appear much slower than necessary.

    Interesting. Could you provide the benchmark used?

    // tb.cpp
    #include <cstdio>
    #include <cstdlib>
    #include <cstdint>
    #include <time.h>
    #include <random>
    #include <algorithm>

    extern "C" {
    void uut(void*, const void*, const void*);
    };

    static inline
    uint64_t umulh(uint64_t a,uint64_t b) {
    return uint64_t(((unsigned __int128)a * b)>>64);
    }

    int main(int , char** )
    {
    const int N_PAIRS = 1000000;
    const int N_ITER = 17;
    typedef unsigned __int128 u128;
    std::vector<u128> src(N_PAIRS*2);
    std::mt19937_64 prng(1);
    const unsigned EXP_BIAS = 6143;
    const unsigned EXP_SHIFT = 113;
    for (int i = 0; i < N_PAIRS*2; ++i) {
    // generate pseudo-random number in range [1e33:1e34-1]
    const uint64_t RNG_LO = (long long)1e17;
    const uint64_t RNG_HI = (long long)9e16;
    const uint64_t BASE_HI = (long long)1e16;
    uint64_t lo = umulh(prng(), RNG_LO); // [0:1e17-1]
    uint64_t hi = umulh(prng(), RNG_HI) + BASE_HI; // [1e16:1e17-1]
    u128 val = (u128)hi*RNG_LO + lo;
    unsigned exp = EXP_BIAS + umulh(prng(), 50) - 25;
    const u128 exp_val = (u128)exp << EXP_SHIFT;
    src[i] = val | exp_val;
    }

    std::vector<long long> dt(N_ITER);
    for (int it = 0; it < N_ITER; ++it) {
    struct timespec t0;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    u128 dummy1 = 0;
    const u128* pSrc = src.data();
    for (int i = 0; i < N_PAIRS; ++i) {
    u128 rat;
    uut(&rat, &pSrc[i*2+0], &pSrc[i*2+1]);
    dummy1 ^= rat;
    }
    if (dummy1 == 42)
    printf("Blue Moon\n");
    struct timespec t1;
    clock_gettime(CLOCK_MONOTONIC, &t1);
    dt[it] = (t1.tv_sec - t0.tv_sec)*(long long)(1e9) + (long
    long)t1.tv_nsec - (long long)t0.tv_nsec; }
    // find median
    std::nth_element(&dt[0], &dt[N_ITER/2], &dt[N_ITER]);
    long long dt_med = dt[N_ITER/2];
    printf("%.1f nsec\n", (double)dt_med / N_PAIRS);

    return 0;
    }
    // end tb.cpp


    // gcc_dec128add.c
    #include <string.h>

    void uut(void* pRes, const void* pA, const void* pB)
    {
    _Decimal128 a, b, res;
    memcpy(&a, pA, sizeof(a));
    memcpy(&b, pB, sizeof(b));
    res = a + b;
    memcpy(pRes, &res, sizeof(res));
    }
    // end gcc_dec128add.c

    // gcc_dec128sub.c
    #include <string.h>

    void uut(void* pRes, const void* pA, const void* pB)
    {
    _Decimal128 a, b, res;
    memcpy(&a, pA, sizeof(a));
    memcpy(&b, pB, sizeof(b));
    res = a - b;
    memcpy(pRes, &res, sizeof(res));
    }
    // end gcc_dec128sub.c

    // gcc_dec128mul.c
    #include <string.h>

    void uut(void* pRes, const void* pA, const void* pB)
    {
    _Decimal128 a, b, res;
    memcpy(&a, pA, sizeof(a));
    memcpy(&b, pB, sizeof(b));
    res = a * b;
    memcpy(pRes, &res, sizeof(res));
    }
    // end gcc_dec128mul.c

    // gcc_dec128div.c
    #include <string.h>

    void uut(void* pRat, const void* pNum, const void* pDen)
    {
    _Decimal128 den, num, rat;
    memcpy(&num, pNum, sizeof(num));
    memcpy(&den, pDen, sizeof(den));
    rat = num / den;
    memcpy(pRat, &rat, sizeof(rat));
    }
    // end gcc_dec128div.c


    Build script
    COPT="-O2 -Wall -march=haswell -mtune=skylake"
    mkdir -p obj
    mkdir -p out
    g++ -c $COPT tb.cpp -o obj/tb.o
    gcc -c $COPT gcc_dec128add.c -o obj/gcc_dec128add.o
    gcc -c $COPT gcc_dec128sub.c -o obj/gcc_dec128sub.o
    gcc -c $COPT gcc_dec128mul.c -o obj/gcc_dec128mul.o
    gcc -c $COPT gcc_dec128div.c -o obj/gcc_dec128div.o
    g++ -s obj/tb.o obj/gcc_dec128add.o -o out/tst_add.exe
    g++ -s obj/tb.o obj/gcc_dec128sub.o -o out/tst_sub.exe
    g++ -s obj/tb.o obj/gcc_dec128mul.o -o out/tst_mul.exe
    g++ -s obj/tb.o obj/gcc_dec128div.o -o out/tst_div.exe





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Dec 17 13:52:10 2025
    From Newsgroup: comp.arch

    On 12/13/2025 2:03 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    [...]

    Take the following algorithm for a semaphore:

    https://www.haiku-os.org/legacy-docs/benewsletter/Issue1-26.html
    (remember that old one?)

    Okay, on the x86/x64 a LOCK XADD will make for a loopless impl. If we
    are on another system and that LOCK XADD is some sort of LL/SC loop,
    well, that causes damage to my loopless claim... ;^o

    Also, big deal!, NOTHING inside the LL/SC can/should touch the
    reservation granule of a target, wrt LL/!!!

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Dec 18 21:33:10 2025
    From Newsgroup: comp.arch

    TonightrCOs quandry. Can a register file write history be used in
    reservation stations to reduce the number of register read ports
    required on the register file? And would it be smaller, faster than
    adding register file ports?

    There are a couple of cases:
    1) The operands were ready at time of queue.
    2) The operands were not ready at time of queue. That means there must
    be an outstanding operation that at some point will update the register
    file.

    Operands that are valid at time of queue are stored in the re-order
    buffer temporarily until dispatch. So, it should be possible to update
    the operands in the reservation stations either from the instruction
    dispatch, or later when the value is written to the register file.

    An issue arises between queue and dispatch. If a value was updated in
    the register file between queue and dispatch then it will not be latched
    by the reservation station because it does not have the instruction yet.
    If the reservation station has the instruction already waiting then the register file write port can be used.

    Rather than having more register file read ports, I was thinking of
    having the reservation stations track (snoop) the register file write
    history using a tapped shift register. The only time that must be
    accounted for is between queue and dispatch. So, assuming that
    instructions dispatched within a certain time frame, then using write
    history could work.

    I was thinking a 64 deep shift register with several taps could be used
    to supply written values.

    It is going to be at least two clock cycles between queue and dispatch
    until operands/instructions are updated in the reservation stations. So,
    the first tap would be at shift 3. Then maybe taps at 6, 12, 24, and 64 clocks.

    There is not usually a huge amount of time between queue and dispatch,
    unless the functional unit is busy. I think the longest operation is <80 clocks, so in worst case a 128-deep shift register should work.

    LUTs can be turned into shift registers up to 64 bits IIRC. With four
    64-bit write ports and 10 bit register tag about 300 LUTs would be
    required for each shift level.

    The history could be shared between several reservation stations.



    --- Synchronet 3.21a-Linux NewsLink 1.2