Forum: Too Lazy BBS

Re: Interrupt enable down-count

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 19:23:03 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/29/2025 6:29 AM, Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.

TradeoffsrCa

I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

ComplexrCa

A simple alternative that I have seen is to have an instruction that
enables interrupts and jumps to somewhere, probably either the
interrupted code or the dispatcher that might do a full context switch.
The ISR would issue this instruction when it has saved everything that
is necessary to handle the interrupt and thus could be interrupted
again. This minimized the time interrupts are locked out without the
need for an arbitrary timer, etc.

Another alternative is to allow ISRs to be interrupted by ISRs of higher priority. All you need here is a clean and precise definition of priority
and when said priority gets associated with any given interrupt.

My 66000 goes so far as to never need to disable interrupts because all interrupts of the same or lower priority are automatically disabled by
the priority of the current ISR/running-thread. That is, one arrives
at the ISR with interrupts enabled and in a reentrant state with the
priority given by the I/O MMU when device sent ISR message to MSI-X
queue.

If/when an ISR needs to be sure it is not interrupted, it can change
priority in 1 instruction to "highest" and have the system not allow
the I/O MMU to associate said "exclusive" priority with any device
interrupt. When ISR returns, priority reverts to priority at the time
the interrupt was taken. {No need to back down on priority} This only
requires that there are enough priorities to spare one exclusively to
the system.

EricP has argued that 8-I/O priority levels are enough. I argue that
64 priority levels are enough for {Guest OS, Host OS, HyperVisor}
to each have their own somewhat-coordinated structure of priorities.
AND further I argue that given one is designing a 64-bit machine,
that 64 priority levels are d|- rigueur.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 15:42:13 2025

From Newsgroup: comp.arch

On 2025-11-29 2:05 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred
because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be
committed because the IRQs got disabled in the meantime. If the CPU were
allowed to accept another IRQ right away, it could get stuck in a loop
flushing the pipeline and reloading with the ISR routine code instead of
progressing through the code where IRQs were disabled.

The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
are allowed to retire (apace) and new instructions are inserted from
the interrupt service point.

That is how Qupls is working too. The issue is what happens when the instruction in the pipe before the ISR disables the interrupt. Then the
ISR instructions need to be flushed.

As long as the instructions "IN" the pipe

can deliver their results to their registers, and update -|Architectural state they "own", there is no reason to flush--AND--no corresponding
reason to delay "taking" the interrupt.

That is the usual case for Qupls too when there is an interrupt.

At the -|Architectural level, you, the designer, see both the front
and the end of the pipeline, you can change what goes in the front
and allow what was already in the pipe to come out the back. This
requires dragging a small amount of information down the pipe, much
like multi-threaded CPUs.

Yes, the IRQ info is being dragged down the pipe.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU
would advance in 40 instruction burps. Alternating between fetching ISR
instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.

TradeoffsrCa

I suppose I could have the CPU increase the down-count if it is looping
around fetching ISR instructions. The down-count would be reset to the
minimum again once an interrupt enable instruction is executed.

ComplexrCa

Make the problem "go away". You will be happier in the end.

The interrupt mask is set at fetch time to disable lower priority
interrupts. I suppose disabling of interrupts by the OS could simply be ignored. The interrupt could only be taken if it is a higher priority
than the current level.

I had thought the OS might have good reason to disable interrupts. But
maybe I am making things too complex.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Nov 29 16:10:45 2025

From Newsgroup: comp.arch

Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.

TradeoffsrCa

I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

ComplexrCa

You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.

I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the
pipeline to drain the old stream before accepting the interrupt and
redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending exceptions in-flight they all are allowed to finish and the state to settle down.

Pipelining interrupt delivery looks possible but gets complicated and
expensive real quick.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 22:17:36 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-29 2:05 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred
because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs >> are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be >> committed because the IRQs got disabled in the meantime. If the CPU were >> allowed to accept another IRQ right away, it could get stuck in a loop
flushing the pipeline and reloading with the ISR routine code instead of >> progressing through the code where IRQs were disabled.

The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
are allowed to retire (apace) and new instructions are inserted from
the interrupt service point.

That is how Qupls is working too. The issue is what happens when the instruction in the pipe before the ISR disables the interrupt. Then the
ISR instructions need to be flushed.

As a general rule of thumb:: an instruction is not "performed" until
after it retires. {when you cannot undo its deeds}

Consider the case where you redirect the front of the pipe to an ISR and
an instruction already in the pipe raises an exception. Here, what I do
{and have done in the past} is to not retire instructions after the
exception, so the ISR is not delayed and IP ends up pointing at the
excepting instruction.

Since you started ISR before you retired DI, you can treat DI as an
exception. {DI after ISR control transfer}. If, on the other hand,
you perform DI at the front of the pipe, you don't "accept" the ISR
until EI.

As long as the instructions "IN" the pipe
can deliver their results to their registers, and update -|Architectural state they "own", there is no reason to flush--AND--no corresponding
reason to delay "taking" the interrupt.

That is the usual case for Qupls too when there is an interrupt.

At the -|Architectural level, you, the designer, see both the front
and the end of the pipeline, you can change what goes in the front
and allow what was already in the pipe to come out the back. This
requires dragging a small amount of information down the pipe, much
like multi-threaded CPUs.

Yes, the IRQ info is being dragged down the pipe.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU
would advance in 40 instruction burps. Alternating between fetching ISR
instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.

TradeoffsrCa

I suppose I could have the CPU increase the down-count if it is looping
around fetching ISR instructions. The down-count would be reset to the
minimum again once an interrupt enable instruction is executed.

ComplexrCa

Make the problem "go away". You will be happier in the end.

The interrupt mask is set at fetch time to disable lower priority interrupts. I suppose disabling of interrupts by the OS could simply be ignored. The interrupt could only be taken if it is a higher priority
than the current level.

I had thought the OS might have good reason to disable interrupts. But
maybe I am making things too complex.

The OS DOES have good reasons to DI "every once in a while", IIRC my conversations with EricP, these are short sequences the OS needs
to be ATOMIC across all OS threads--and almost always without the
possibility that the ATOMIC event fails {which can happen in user code}.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 22:26:21 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.

TradeoffsrCa

I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

ComplexrCa

You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.

Yes, exactly::

Consider a GBOoO processor that performs a LD R9,[deviceCR].

a) all earlier memory references have to be seen globally
...before this LD can be seen globally. {dozens of cycles}
b) this LD has to arrive at HostBridge. {dozens of cycles}
c) HostBrdge sends request down PCIe {hundreds of cycles}
d) device responds to LD {handful of cycles}
e) PCIe transports response to HB {hundreds of cycles}
f) HB transfers response to requestor {dozens of cycles}
g) CPU is allowed to re-enter OoO {handful of cycles}

Accesses to devices need to have most of the properties of
"Sequential Consistency" as defined by Lamport.

Now, several LDs [DeviceCRs] can be seen globally and in order
before the first (or all responses) but you are going to see all
that latency in the pipeline; but OoO memory requests are not one
of them.

I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending exceptions in-flight they all are allowed to finish and the state to settle down.

Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 17:45:17 2025

From Newsgroup: comp.arch

On 2025-11-29 4:10 p.m., EricP wrote:

Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU
were allowed to accept another IRQ right away, it could get stuck in a
loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.

TradeoffsrCa

I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.

ComplexrCa

You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.

The down count is counting down only when the front-end of the pipeline advances, instructions are sure to be loaded.

I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending
exceptions
in-flight they all are allowed to finish and the state to settle down.

Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.

The base down count increases every time the IRQ is found at the commit
stage. If the base down count is too large (stuck interrupt) then an
exception is processed. For instance if interrupts were disabled for
1000 clocks.

I think the mechanism could work, complicated though.

Treating the DI as an exception, as mentioned in another post would also
work. It is a matter then of flushing the instructions between the DI
and ISR.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 23:14:23 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-29 4:10 p.m., EricP wrote:

Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU
were allowed to accept another IRQ right away, it could get stuck in a
loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.

TradeoffsrCa

I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.

ComplexrCa

You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.

The down count is counting down only when the front-end of the pipeline advances, instructions are sure to be loaded.

I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt enable or disable instructions, or branch mispredicts, or pending exceptions
in-flight they all are allowed to finish and the state to settle down.

Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.

The base down count increases every time the IRQ is found at the commit stage. If the base down count is too large (stuck interrupt) then an exception is processed. For instance if interrupts were disabled for
1000 clocks.

I think the mechanism could work, complicated though.

Treating the DI as an exception, as mentioned in another post would also work. It is a matter then of flushing the instructions between the DI
and ISR.

Which is no different than flushing instructions after a mispredicted branch. --- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 29 23:37:21 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

(Looking at your
code, it also does not seem to be self-sufficient, at least the
numerous SKIP4 statements require something else).

If you want to assemble the resulting .S file, it's assembled once
with

-DSKIP4= -Dgforth_engine2=gforth_engine

and once with

-DSKIP4=".skip 4"

(on Linux-GNU AMD64, the .skip assembler directive is autoconfigured
and may be different on other platforms).

My assumption is that the control flow is confusing gcc.

My guess is the same.

Both our guesses were wrong, and Scott (I think) was on the right
track - this is a signed / unsigned issue. A reduced test case is

void bar(unsigned long, long);

void foo(unsigned long u1)
{
long u3;
u1 = u1 / 10;
u3 = u1 % 10;
bar(u1,u3);
}

This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 02:17:10 2025

From Newsgroup: comp.arch

On 2025-11-29 6:14 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-29 4:10 p.m., EricP wrote:

Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU >>>> were allowed to accept another IRQ right away, it could get stuck in a >>>> loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.

TradeoffsrCa

I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.

ComplexrCa

You are using this timer to predict the delay for draining the pipeline. >>> It would only take a read of a slow IO device register to exceed it.

The down count is counting down only when the front-end of the pipeline
advances, instructions are sure to be loaded.

I was thinking a simple and cheap way would be to use a variation of the >>> single-step mechanism. An interrupt request would cause Decode to emit a >>> special uOp with the single-step flag set and then stall, to allow the
pipeline to drain the old stream before accepting the interrupt and
redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending
exceptions
in-flight they all are allowed to finish and the state to settle down.

Pipelining interrupt delivery looks possible but gets complicated and
expensive real quick.

The base down count increases every time the IRQ is found at the commit
stage. If the base down count is too large (stuck interrupt) then an
exception is processed. For instance if interrupts were disabled for
1000 clocks.

I think the mechanism could work, complicated though.

Treating the DI as an exception, as mentioned in another post would also
work. It is a matter then of flushing the instructions between the DI
and ISR.

Which is no different than flushing instructions after a mispredicted branch.

Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works
almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot simpler.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 10:10:00 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot simpler.

What is the expected delay until an interrupt is delivered?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 06:29:55 2025

From Newsgroup: comp.arch

On 2025-11-30 5:10 a.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works
almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot
simpler.

What is the expected delay until an interrupt is delivered?

I set the timing to 16 clocks which is about 64 (or more) instructions.
Did not want to go much over 1% the number of instructions executed.
Not every instruction inserts a poll, so sometimes a poll is lacking.
IDK how well it will work. Making it an instruction means it might also
be used by software.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 06:41:52 2025

From Newsgroup: comp.arch

On 2025-11-30 6:29 a.m., Robert Finch wrote:

On 2025-11-30 5:10 a.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works
almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot >>> simpler.

What is the expected delay until an interrupt is delivered?

I set the timing to 16 clocks which is about 64 (or more) instructions.
Did not want to go much over 1% the number of instructions executed.
Not every instruction inserts a poll, so sometimes a poll is lacking.
IDK how well it will work. Making it an instruction means it might also
be used by software.

Might be able to modify the branch predictor to predict the interrupt.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 14:14:16 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Both our guesses were wrong, and Scott (I think) was on the right
track - this is a signed / unsigned issue. A reduced test case is

void bar(unsigned long, long);

void foo(unsigned long u1)
{
long u3;
u1 = u1 / 10;
u3 = u1 % 10;
bar(u1,u3);
}

Assigning to u1 changed the meaning, as Andrew Pinski noted; so the
jury is still out on what the actual problem is.

This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .

and a revised one at
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122919>

(The announced attachment is not there yet.)

The latter case is interesting, because real_ca and spc became global,
and symbols[] is still local, and no assignment to real_ca happens
inside foo().

So one way the compiler could interpret this code might be that
real_ca gets one of the labels whose address is taken in some way
unknown to the compiler; the it has to preserve all the code reachable
through the labels.

Another way to interpret this code would be that symbols is not used,
so it is dead and can be optimized away. Consequently, none of the
addresses of any of the labels is ever taken, and the labels are not
used by direct jumps, either, so all the code reachable only by
jumping to the labels is unreachable and can be optimized away.

Apparently gcc takes the latter attitude if there are <=100 labels in
symbols, but maybe something like the former attitude if there are

100 labels in symbols. This may appear strange, but gcc generally

tends to produce good code in relatively short time for Gforth (while
clang generates horribly slow code and takes extremely long in doing
so), and my guess is that having such a cutoff on doing the usual
analysis has something to do with gcc's superior performance.

I guess that if you treat symbols like in the original code (i.e.,
return it in one case), you can reduce the labels more without the
compiler optimizing everything away. I don't dare to predict when the
compiler will stop generating the inefficient variant. Maybe it has
to do with the cutoff.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 15:47:03 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Both our guesses were wrong, and Scott (I think) was on the right
track - this is a signed / unsigned issue. A reduced test case is

void bar(unsigned long, long);

void foo(unsigned long u1)
{
long u3;
u1 = u1 / 10;
u3 = u1 % 10;
bar(u1,u3);
}

Assigning to u1 changed the meaning, as Andrew Pinski noted;

An example which could be tested at run-time to verify correct
operation was not provided, so I had to do without.

In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.

cvise uses a user-supplied "interestingness script" which returns
0 if the feature in question is there, or non-zero if it is
not there. For relatively simple cases like an ICE, it
can have two steps: a) check that compilation fails, and b)
check that the error messages is output.

Looking for a missed optimization is more difficult, especially
in the absence of a run-time test. It is then necessary to

a) check the source code that the interesting code is still there

b) compile the code (exiting if this fails)

c) verify the generated assembly that it still does the same

a) and c) are very easy to get wrong, and there were numerous
false reductions where cvise came up with something that the
scripts didn't catch.

so the
jury is still out on what the actual problem is.

This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .

and a revised one at
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122919>

(The announced attachment is not there yet.)

The latter case is interesting, because real_ca and spc became global,
and symbols[] is still local, and no assignment to real_ca happens
inside foo().

That is what cvise does. It sometimes reduces code more than a
human would.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 15:18:21 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH Z|+rich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

So they can properly simluate a pipelined processor?

Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.

It's useful for code optimization, as well.

In what way?

In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier? If not, what is it about pipelined processors
that would require CS graduates to know about them?

Processor pipelines are not the basics of what a CS graduate is doing.
They are an implementation detail in computer engineering.

Which affect the performance of the software created by the
software engineer (CS graduate).

By a constant factor; and the software creator does not need to know
that the CPU that executes instructions at 2 CPI (486) instead of at
10 CPI (VAX-11/780) is pipelined; and these days both the 486 and the
VAX are irrelevant to software creators.

A few more examples where compilers are not as good as even I expected:

Just today, I compiled

u4 = u1/10;
u3 = u1%10;

(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):

movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9

The major difference is that in the left context, u3 is stored into
memory (at 0x8(%r13)), while in the right context, it stays in a
register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
again.

Sort of emphasizes that programmers need to understand the
underlying hardware.

I am the programmer of the code shown above. In what way would better knowledge of the hardware made me aware that gcc would produce
suboptimal code in some cases?

What were u1, u3 and u4 declared as?

unsigned long (on that platform).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 16:39:41 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.

I have now done a manual reduction myself; essentially I left only the
3 variants of the VM instruction that performs 10/, plus all the
surroundings, and I added code to ensure that spTOS, spb, and spc are
not dead. You find the result at

http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

The result of compiling this with

gcc -I./../arch/amd64 -I. -Wall -g -O2 -fomit-frame-pointer -pthread -DHAVE_CONFIG_H -DFORCE_LL -DFORCE_REG -DDEFAULTPATH='".:/usr/local/lib/gforth/site-forth:/usr/local/lib/gforth/0.7.9_20251119:/usr/local/share/gforth/0.7.9_20251119:/usr/share/gforth/site-forth:/usr/local/share/gforth/site-forth"' -c -fno-gcse -fcaller-saves -fno-defer-pop -fno-inline -fwrapv -fno-strict-aliasing -fno-cse-follow-jumps -fno-reorder-blocks -fno-reorder-blocks-and-partition -fno-toplevel-reorder -falign-labels=1 -falign-loops=1 -falign-jumps=1 -fno-delete-null-pointer-checks -fcf-protection=none -fno-tree-vectorize -fno-lto -pthread -DENGINE=2 -fPIC -DPIC -o libengine-fast2-ll-reg-red.S -S engine-fast-red.i

can be found at

http://www.complang.tuwien.ac.at/anton/tmp/libengine-fast2-ll-reg-red.S

Now the multiplier is permanently allocated to %r11, so searching for
it won't help. However, if you search for "mulq", you will find the
code generated for the three instances of the VM instruction. The
first is optimized well, the second exhibits two mulqs and two shrqs,
the third exhibits just one mulq, but two shrqs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 18:59:15 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.

I have now done a manual reduction myself; essentially I left only the
3 variants of the VM instruction that performs 10/, plus all the surroundings, and I added code to ensure that spTOS, spb, and spc are
not dead. You find the result at

http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

Do you have an example which tests the codepath taken for the
offending piece of code, so it is possible to further reduce this
case automatically? The example is still quite big (>13000 lines).
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 30 19:33:47 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH Z|a-+rich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

So they can properly simluate a pipelined processor?

Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.

It's useful for code optimization, as well.

In what way?

In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?

You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

If not, what is it about pipelined processors
that would require CS graduates to know about them?

How execution order disturbs things like program order and memory order.
That is how and when they need to insert Fences in their multi-threaded
code.

Processor pipelines are not the basics of what a CS graduate is doing. >>They are an implementation detail in computer engineering.

Which affect the performance of the software created by the
software engineer (CS graduate).

By a constant factor; and the software creator does not need to know
that the CPU that executes instructions at 2 CPI (486) instead of at
10 CPI (VAX-11/780) is pipelined; and these days both the 486 and the
VAX are irrelevant to software creators.

I do not believe that the word "the" in front of x86 or VAX is proper.

A few more examples where compilers are not as good as even I expected:

Just today, I compiled

u4 = u1/10;
u3 = u1%10;

(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):

movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9

The major difference is that in the left context, u3 is stored into >>memory (at 0x8(%r13)), while in the right context, it stays in a >>register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
again.

Sort of emphasizes that programmers need to understand the
underlying hardware.

I am the programmer of the code shown above. In what way would better knowledge of the hardware made me aware that gcc would produce
suboptimal code in some cases?

Reading and thinking about the asm-code and running the various code
sequences enough times that you can measure which is better and which
is worse. That is the engineering part of software Engineering.

What were u1, u3 and u4 declared as?

unsigned long (on that platform).

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Sun Nov 30 22:38:39 2025

From Newsgroup: comp.arch

On 2025-11-30 21:33, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH Z|a-+rich had heard about >>>>>>>> pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

So they can properly simluate a pipelined processor?

Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.

It's useful for code optimization, as well.

In what way?

In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?

You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

If not, what is it about pipelined processors
that would require CS graduates to know about them?

How execution order disturbs things like program order and memory order.
That is how and when they need to insert Fences in their multi-threaded
code.

That is an aspect of processor architecture that is relevant to some programmers, but not to the large number of programmers who use
languages or operating systems with built-in multi-threading and safe inter-thread communication primitives and services for input/output.

I am the programmer of the code shown above. In what way would better
knowledge of the hardware made me aware that gcc would produce
suboptimal code in some cases?

Reading and thinking about the asm-code and running the various code sequences enough times that you can measure which is better and which
is worse. That is the engineering part of software Engineering.

That is a very niche part of software (performance) engineering. Speed
of execution is only one of many "goodness" dimensions of a piece of SW, others including correctness, reliability, security, portability, maintainability, and so on. All dimensions need and depend on systematic engineering, although some dimensions cannot be quantified as easily as execution speed.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 22:11:26 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.

I have now done a manual reduction myself; essentially I left only the
3 variants of the VM instruction that performs 10/, plus all the
surroundings, and I added code to ensure that spTOS, spb, and spc are
not dead. You find the result at

http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

Do you have an example which tests the codepath taken for the
offending piece of code,

Not easily.

so it is possible to further reduce this
case automatically? The example is still quite big (>13000 lines).

Most of which is coming from including stdlib.h etc. The actual code
of the gforth_engine function in that example is 264 lines, many of
which are empty or line number indicators.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 22:17:19 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

scott@slp53.sl.home (Scott Lurndal) writes:

In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?

You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

Good to know, but does not answer the question.

If not, what is it about pipelined processors
that would require CS graduates to know about them?

How execution order disturbs things like program order and memory order.
That is how and when they need to insert Fences in their multi-threaded >code.

And the relevance of pipelined processors for that issue is what?

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude. If you implement per-CPU caches and multiple memory controllers as shoddily
as possible while providing features for programs to slow themselves
down heavily in order to get memory-ordering guarantess, then you get
a weak memory model; slightly less shoddy, and you get a "strong" memory
model. Processor pipelines have no relevance here.

And, as Niklas Holsti observed, dealing with memory-ordering
shenanigans is something that a few specialists do; no need for others
to know about the memory model, except that common CPUs unfortunately
do not implement sequential consistency.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 00:12:15 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

scott@slp53.sl.home (Scott Lurndal) writes:

In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?

You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

Good to know, but does not answer the question.

{Without contradicting that Wallace got on the correct track first}
Wallace gets the credit that should rightly go to Dadda.

If not, what is it about pipelined processors
that would require CS graduates to know about them?

How execution order disturbs things like program order and memory order. >That is how and when they need to insert Fences in their multi-threaded >code.

And the relevance of pipelined processors for that issue is what?

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

If you implement per-CPU caches and multiple memory controllers as shoddily
as possible while providing features for programs to slow themselves
down heavily in order to get memory-ordering guarantess, then you get
a weak memory model; slightly less shoddy, and you get a "strong" memory model. Processor pipelines have no relevance here.

It is the pipelines themselves (along with the SuperComputer attitude)
that gives rise to the weak memory models.

And, as Niklas Holsti observed, dealing with memory-ordering
shenanigans is something that a few specialists do; no need for others
to know about the memory model, except that common CPUs unfortunately
do not implement sequential consistency.

Because of the SuperComputer attitude ! {Performance first}

And only after several languages built their own ATOMIC primitives, so
the programmers could remain ignorant. But this also ties the hands of
the designers in such a way that performance grows ever more slowly
with more threads.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Dec 1 07:56:37 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.

The advocates of weaker memory models justify them by pointing to the
slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.

In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.

If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong"
ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the
slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and
imprecise exceptions, if you compile with trapb, you get slowness and
precise exceptions. I then measured SPEC 95 compiled without and with
trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
there was hardly any difference; I believe that trapb is a noop on the
21264. Here's the SPECfp_base95 numbers:

with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264

So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

And only after several languages built their own ATOMIC primitives, so
the programmers could remain ignorant. But this also ties the hands of
the designers in such a way that performance grows ever more slowly
with more threads.

Maybe they could free their hands by designing for a
sequential-consistency interface, just like designing for a simple sequential-execution model without EPIC features freed their hands to
design microarchitectural features that allowed ordinary code to
utilize wider and wider OoO cores profitably.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Dec 1 13:23:22 2025

From Newsgroup: comp.arch

On Mon, 01 Dec 2025 07:56:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.

The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.

In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.

If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?

More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?

IMHO, the only simple thing about sequential consistency is simple
description. Other than that, it simplifies very little. It does not
magically make lockless multithreaded programming bearable to
non-genius coders.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Dec 1 14:07:34 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.

The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.

In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.

If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and
imprecise exceptions, if you compile with trapb, you get slowness and
precise exceptions. I then measured SPEC 95 compiled without and with
trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
there was hardly any difference; I believe that trapb is a noop on the
21264. Here's the SPECfp_base95 numbers:

with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264

So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

The 21264 Hardware Reference Manual says TRAPB (general exception barrier)
and EXCB (floating point control register barrier) are both NOP's
internally, are tossed at decode, and don't even take up an
instruction slot.

The purpose of the EXCB is to synchronize pipeline access to the
floating point control and status register with FP operations.
In the worst case this stalls until the pipeline drains.

I wonder how much logic it really saved allowing imprecise exceptions
in the InO 21064 and 21164? Conversely, how much did it cost to deal
with problems caused by leaving these interlocks off?

The cores have multiple, parallel pipelines for int, lsq, fadd and fmul. Without exception interlocks, each pipeline only obeys the scoreboard
rules for when to writeback its result register: WAW and WAR.
That allows a younger, faster instruction to finish and write its register before an older, slower instruction. If that older instruction then throws
an exception and does not write its register then we can see the out of
order register writes.

For register file writes to be precise in the presence of exceptions
requires each instruction look ahead at the state of all older
instructions *in all pipelines*.
Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.
A writeback can occur if there are no WAW or WAR dependencies,
and all older uOps are Resolved_Normal.

Just off the top of my head, in addition to the normal scoreboard,
a FIFO buffer with a priority selector could be used to look ahead
at all older uOps and check their status, and allow or stall uOp
writebacks and ensure registers always appear precise.
Which really doesn't look that expensive.

Is there something I missed, or would that FIFO suffice?

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 22:50:15 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.

The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.

In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.

If hardware designers put their mind to it, they could make sequential consistency perform well,

Depends on your definition of SC and "performs well", but see below::

probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

In the case of My 66000, there is a slightly weak memory model
(Causal consistency) for accesses to DRAM, and there is Sequential
consistency for ATOMIC stuff and device control registers, and then
there is strongly ordered for configuration space access, and the
programmer does not have to do "jack" to get these orderings--
its all programmed in the PTEs.

{{There is even a way to make DRAM accesses SC should you want.}}

A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and
imprecise exceptions, if you compile with trapb, you get slowness and
precise exceptions. I then measured SPEC 95 compiled without and with
trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
there was hardly any difference; I believe that trapb is a noop on the
21264. Here's the SPECfp_base95 numbers:

with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A

moderate slowdown

19.7 20.0 Compaq XP1000 500MHz 21264

slowdown has disappeared.

So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

And only after several languages built their own ATOMIC primitives, so
the programmers could remain ignorant. But this also ties the hands of
the designers in such a way that performance grows ever more slowly
with more threads.

Maybe they could free their hands by designing for a
sequential-consistency interface, just like designing for a simple sequential-execution model without EPIC features freed their hands to
design microarchitectural features that allowed ordinary code to
utilize wider and wider OoO cores profitably.

That is not the property I was getting at--the property I was getting at
is that the language model for synchronization can only use 1 memory
location {TS, TTS, CAS, DCAS, LL, SC} and this fundamentally limits the
amount of work one can do in a single event, and also fundamentally limits
what one can "say" about a concurrent data structure.

Given a certain amount of interference--the fewer ATOMIC things one has
to do the lower the chance of interference, and the greater the chance
of success. So, if one could move an element of a CDS from one location
to another in one ATOMIC event rather than 2 (or 3) then the exponent
of synchronization overhead goes down, and then one can make statements
like "and no outside observer can see the CDS without that element present"--which cannot be stated with current models.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 23:03:24 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.

The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.

In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and "strong" memory models over sequential consistency.

If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and imprecise exceptions, if you compile with trapb, you get slowness and precise exceptions. I then measured SPEC 95 compiled without and with trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264 there was hardly any difference; I believe that trapb is a noop on the 21264. Here's the SPECfp_base95 numbers:

with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264

So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

The 21264 Hardware Reference Manual says TRAPB (general exception barrier) and EXCB (floating point control register barrier) are both NOP's
internally, are tossed at decode, and don't even take up an
instruction slot.

The purpose of the EXCB is to synchronize pipeline access to the
floating point control and status register with FP operations.
In the worst case this stalls until the pipeline drains.

I wonder how much logic it really saved allowing imprecise exceptions
in the InO 21064 and 21164?

Having done something similar in Mc 88100, I can state that the amount
of logic saved is too small to justify such n|>evity.

Conversely, how much did it cost to deal
with problems caused by leaving these interlocks off?

Way toooooo much. The SW delay to get all those things right cost more
time than HW designers could have possibly saved leaving them out.

The cores have multiple, parallel pipelines for int, lsq, fadd and fmul. Without exception interlocks, each pipeline only obeys the scoreboard
rules for when to writeback its result register: WAW and WAR.
That allows a younger, faster instruction to finish and write its register before an older, slower instruction. If that older instruction then throws
an exception and does not write its register then we can see the out of
order register writes.

For register file writes to be precise in the presence of exceptions
requires each instruction look ahead at the state of all older
instructions *in all pipelines*.

Or you use dead stages in the pipelines so instructions arrive at
RF write ports no earlier than their compatriots. you still have to
look across all the delay slots for forwarding opportunities.

Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.
A writeback can occur if there are no WAW or WAR dependencies,
and all older uOps are Resolved_Normal.

That is the scoreboard model. The Reservation station has a simpler
model by providing unique register for each instruction (or -|Op).

Just off the top of my head, in addition to the normal scoreboard,
a FIFO buffer with a priority selector could be used to look ahead
at all older uOps and check their status,

Such a block of logic is called a ReOrder Buffer.

Given an architectural register file with 16-32 entries, and
given a reorder buffer of 96+ entries--if you integrate both
ARF and RoB into a single structure you call it a physical
register file. A PRF is just a RoB that is big enough never
to have to migrate registers to the ARF.

and allow or stall uOp
writebacks and ensure registers always appear precise.
Which really doesn't look that expensive.

Is there something I missed, or would that FIFO suffice?

If the FiFo is big enough, it works just fine; if you scrimp on
the FiFo, you will want to play games with orderings to make it
faster.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Dec 2 07:10:16 2025

From Newsgroup: comp.arch

Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for protection or translation of the address.

It would be quite slow to have the instructions reissued and percolate
down the cache access again.

This should only be an issue if an unaligned access crosses a memory
page boundary.

The instruction causes an alignment fault if a page cross boundary is detected.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Dec 2 18:50:12 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for >protection or translation of the address.

It would be quite slow to have the instructions reissued and percolate
down the cache access again.

Unaligned access on a page boundary is extremely slow on the Core 2
Duo (IIRC 160 cycles for a store). So don't be shy:-)

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 2 19:55:43 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for protection or translation of the address.

You can determine is an access is misaligned "enough" to warrant two
trips down the pipe.
a) crosses cache width
b) crosses page boundary

Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.

It would be quite slow to have the instructions reissued and percolate
down the cache access again.

An AGEN-like adder has 11-gates of delay, you can determine misaligned
by 4-gates of delay.

This should only be an issue if an unaligned access crosses a memory
page boundary.

Here you need to access the TLB twice.

The instruction causes an alignment fault if a page cross boundary is detected.

probably not as wise as you think.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Dec 2 21:20:33 2025

From Newsgroup: comp.arch

On 2025-12-02 2:55 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for
protection or translation of the address.

You can determine is an access is misaligned "enough" to warrant two
trips down the pipe.
a) crosses cache width
b) crosses page boundary

Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.

It would be quite slow to have the instructions reissued and percolate
down the cache access again.

An AGEN-like adder has 11-gates of delay, you can determine misaligned
by 4-gates of delay.

I was thinking in terms of clock cycles. The recalc of the address could
be triggered by resetting bits in the reorder buffer. Which causes the instruction to be re-dispatched. I am not sure how many clocks, but
likely a minimum of four or five. Memory access is sequential, so it
will stall other accesses too.

I have a tendency not to think about the gate delays too much, until
they appear on the timing path. The lookup tables can absorb a good
chunk of gates delay.

This should only be an issue if an unaligned access crosses a memory
page boundary.

Here you need to access the TLB twice.

The instruction causes an alignment fault if a page cross boundary is
detected.

probably not as wise as you think.

I coded it so it makes two trips to the TLB now for page boundaries (in theory). I got to thinking that maybe the page size could be made huge
to avoid page crossings.

I may need to put more logic in to ensure the same load store queue slot
is used. I think it should work since things are sequential.

My toy is broken. It is taking too long to synthesize. Qupls is so
complex now. I may pick something simpler.

--- Synchronet 3.21a-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Thu Dec 4 16:54:56 2025

From Newsgroup: comp.arch

In article <20251201132322.000051a5@yahoo.com>,
Michael S <already5chosen@yahoo.com> wrote:

On Mon, 01 Dec 2025 07:56:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

[snip]

If hardware designers put their mind to it, they could make sequential
consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong"
ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the
slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?

More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?

IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.

Compiler writers have hidden behind the hardware complexity to make
writing source code that is thread-safe much harder than it should be.
If you have to support placing hardware barriers, then the languages
can get away with needing lots of <atomic> qualifiers everywhere, even
on systems which don't need barriers, making the code more complex. And language purists still love to sneer at volatile in C-like languages as "providing no guarantees, and so is essentially useless"--when volatile providing no guarantees is a language and compiler choice, not something written in stone. A bunch of useful algorithms could be written with
merely "volatile" like semantics, but for some reason, people like the line-noise-like junk of C++ atomics, where rather than thinking in terms
of the algorithm, everyone needs to think in terms of release and acquire. (Which are weakly-ordering concepts).

Kent
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 4 18:37:54 2025

From Newsgroup: comp.arch

kegs@provalid.com (Kent Dickey) posted:

In article <20251201132322.000051a5@yahoo.com>,
Michael S <already5chosen@yahoo.com> wrote:

On Mon, 01 Dec 2025 07:56:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

[snip]

If hardware designers put their mind to it, they could make sequential
consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong"
ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the
slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?

More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?

IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.

Compiler writers have hidden behind the hardware complexity to make
writing source code that is thread-safe much harder than it should be.

Blaming the wrong people.

If you have to support placing hardware barriers, then the languages
can get away with needing lots of <atomic> qualifiers everywhere, even
on systems which don't need barriers, making the code more complex. And

Thread-safe, by definition, is (IS) harder.

language purists still love to sneer at volatile in C-like languages as "providing no guarantees, and so is essentially useless"--when volatile providing no guarantees is a language and compiler choice, not something written in stone.

The problem with volatile is that all it means is the every time a volatile variable is touched, the code has to have a corresponding LD or ST. The HW
ends up knowing nothing about the value's volativity and ends up in no
position to help.

A bunch of useful algorithms could be written with
merely "volatile" like semantics, but for some reason, people like the line-noise-like junk of C++ atomics, where rather than thinking in terms
of the algorithm, everyone needs to think in terms of release and acquire. (Which are weakly-ordering concepts).

As far as ATOMICs go:: until you can code a single ATOMIC event that moves
an element of a concurrent data structure from one place to another in a
single event, you are thinking too SMALL (4-pointers in 4 different cache lines).

In addition, the code should NOT have to test for success failure, but
be defined in such a way that if you get here success is known and if
you get there, failure is known.

Kent

Mitch
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 11:10:22 2025

From Newsgroup: comp.arch

On 04/12/2025 19:37, MitchAlsup wrote:

kegs@provalid.com (Kent Dickey) posted:

Thread-safe, by definition, is (IS) harder.

language purists still love to sneer at volatile in C-like languages as
"providing no guarantees, and so is essentially useless"--when volatile
providing no guarantees is a language and compiler choice, not something
written in stone.

The problem with volatile is that all it means is the every time a volatile variable is touched, the code has to have a corresponding LD or ST. The HW ends up knowing nothing about the value's volativity and ends up in no position to help.

"volatile" /does/ provide guarantees - it just doesn't provide enough guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever. But
you need volatile semantics for atomics and fences as well - there's no
point in enforcing an order at the hardware level if the accesses can be re-ordered at the software level!

"volatile" on its own is therefore not sufficient for atomics on big
modern processors. But it /is/ sufficient for some uses, such as
accessing hardware registers, or for small atomic loads and stores on
single processor systems (which are far and away the biggest market, as embedded microcontrollers).

As I see it, the biggest problem with "volatile" in C is
misunderstandings and misuse of all sorts. At least, that's what I see
in my field of embedded development.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Dec 5 14:37:57 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
and I don't think that C with just volatile gives you such guarantees.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 18:29:48 2025

From Newsgroup: comp.arch

On 05/12/2025 15:37, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".

It says a good deal about the ordering at the C level - but nothing
about it at the memory level.

I know very little about the MMU setups on "big" systems like the x86-64 world. But in the embedded microcontroller world, it is very common for
areas of the memory map to have sequential consistency even if other
areas can be re-ordered, cached, or otherwise jumbled around. Thus for memory-mapped peripheral areas, memory accesses are kept strictly in
order and "volatile" is all you need.

If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.

Sure. Of course multi-core systems will not have that hardware
guarantee, at least not on main memory, for performance reasons. So
there you need something more than just C "volatile" to force specific orderings. But volatile semantics will still be needed in many cases.
Thus "volatile" is not sufficient, but it is still necessary. Usually,
of course, all necessary "volatile" qualifiers are included in OS or
library macros or functions for anything that needs them for locks or inter-process communication and the like. (In Linux, you have the
READ_ONCE and WRITE_ONCE macros, which are just wrappers forcing
volatile accesses.)

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
and I don't think that C with just volatile gives you such guarantees.

Correct.

Getting this wrong is one of the problems I have seen with volatile
usage in embedded systems. I've seen people assuming that declaring "x"
as "volatile" means that "x++;" is an atomic operation, or that volatile
alone lets you share 64-bit data between threads on a 32-bit processor.

Used correctly, it /can/ be enough for shared data between pre-emptive
threads or a main loop and interrupts on a single core system. But
sometimes you need to do more (for microcontrollers, that usually means disabling interrupts for a short period).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 5 17:57:48 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that >affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

and I don't think that C with just volatile gives you such guarantees.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 20:10:11 2025

From Newsgroup: comp.arch

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor. Even with a
single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.

and I don't think that C with just volatile gives you such guarantees.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 5 20:54:00 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically, >>> it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Even with a single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.

and I don't think that C with just volatile gives you such guarantees.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 5 14:55:36 2025

From Newsgroup: comp.arch

On 12/5/2025 12:54 PM, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Any issues with live lock in here?

[...]
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 5 15:03:53 2025

From Newsgroup: comp.arch

On 12/5/2025 11:10 AM, David Brown wrote:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems.-a Basically, >>>> it only works at the C abstract machine level - it does nothing that
affects the hardware.-a So volatile writes are ordered at the C level, >>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

It's strange that double-word compare and swap (DWCAS), where the words
are contiguous. Well, I have seen compilers say its not lock-free even
on a x86. for a 32 bit system we have cmpxchg8b. For a 64 bit system cmpxchg16b. But the compiler reports not lock free. Strange.

using cmpxchg instead of xadd:
https://forum.pellesc.de/index.php?topic=7167.0

trying to tell me that a DWCAS is not lock free: https://forum.pellesc.de/index.php?topic=7311.msg27764#msg27764

This should be lock-free on an x86, even x64:

struct ct_proxy_dwcas
{
struct ct_proxy_node* node;
intptr_t count;
};

some of my older code:

AC_SYS_APIEXPORT
int AC_CDECL
np_ac_i686_atomic_dwcas_fence
( void*,
void*,
const void* );

np_ac_i686_atomic_dwcas_fence PROC
push esi
push ebx
mov esi, [esp + 16]
mov eax, [esi]
mov edx, [esi + 4]
mov esi, [esp + 20]
mov ebx, [esi]
mov ecx, [esi + 4]
mov esi, [esp + 12]
lock cmpxchg8b qword ptr [esi]
jne np_ac_i686_atomic_dwcas_fence_fail
xor eax, eax
pop ebx
pop esi
ret

np_ac_i686_atomic_dwcas_fence_fail:
mov esi, [esp + 16]
mov [esi + 0], eax;
mov [esi + 4], edx;
mov eax, 1
pop ebx
pop esi
ret
np_ac_i686_atomic_dwcas_fence ENDP

Even with a
single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.

and I don't think that C with just volatile gives you such guarantees.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 00:40:11 2025

From Newsgroup: comp.arch

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.

Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
to zero only in the agen units. So, the bypass is only in a couple of
places. Otherwise r0 can be used as an ordinary register. Load / store instructions cannot use r0 as a GPR then, but it works for the PowerPC.

I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better
to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.

Anyone got a summary of bypassing r0 in different architectures?

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Dec 6 07:26:24 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.

My impression is that modern implementations deal with this kind of
stuff at decoding or in the renamer. That should reduce the number of
places where it is special-cased to one, but it means that the uops
have to represent 0 in some way. One way would be to have a physical
register that is 0 and that is never allocated, but if your
microarchitecture needs a reduction of actual read ports (compared to
potential read ports), you may prefer a different representation of 0
in the uops.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 05:13:01 2025

From Newsgroup: comp.arch

On 2025-12-06 2:26 a.m., Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.

My impression is that modern implementations deal with this kind of
stuff at decoding or in the renamer. That should reduce the number of
places where it is special-cased to one, but it means that the uops
have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
in the uops.

- anton

Thanks,

It should have occurred to me to do this at the decode stage. Constants
are decoded and passed along for all register fields in decode. There
are only four decoders fortunately.

Switching the ISA back to having r0 as zero all the time.

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sat Dec 6 14:42:13 2025

From Newsgroup: comp.arch

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't require additional hardware.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Of course. It is interesting to speculate about possible features of an architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).

Even with a
single core system you can have pre-emptive multi-threading, or at least
interrupt routines that may need to cooperate with other tasks on data.

and I don't think that C with just volatile gives you such guarantees. >>>>
- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Dec 6 17:16:11 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically, >> >>> it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

Individual atomic "Remove Element" and "Insert/Append Element"[*], yes. Combined? Too inflexible.

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Atomic add/sub are useful. The other atomic math operations (min, max, etc) may be useful in certain cases as well.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:22:55 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/5/2025 12:54 PM, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Any issues with live lock in here?

A bit hard to tell because of 2 things::
a) I carry around the thread priority and when interference occurs,
the higher priority thread wins--ties the already accessed thread wins.
b) live-lock is resolved or not by the caller to these routines, not
these routines themselves.

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:29:53 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can substitute small constants for register values.

Often the use of R0 as an operand causes the calculation to be degenerate.
That is, R0 is not needed at all.
ADD R9,R7,R0 // is a MOV instruction
AND R9,R7,R0 // is a CLR instruction

So, you don't have to treat R0 in bypassing, but as Operand processing.

Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
to zero only in the agen units. So, the bypass is only in a couple of places. Otherwise r0 can be used as an ordinary register. Load / store instructions cannot use r0 as a GPR then, but it works for the PowerPC.

AGEN Rbase ==R0 implies Rbase = IP
AGEN Rindex==R0 implies Rindex = 0

I hit this trying to decide where to bypass another register code to represent the instruction pointer. In that case I think it may be better
to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.

Anyone got a summary of bypassing r0 in different architectures?

These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random register.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:31:43 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Robert Finch <robfi680@gmail.com> writes:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.

My impression is that modern implementations deal with this kind of
stuff at decoding or in the renamer. That should reduce the number of
places where it is special-cased to one, but it means that the uops
have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
in the uops.

Another way to implement R0 is to have an AND gate after the Operand
flip-flop, and if <whatever> was captured is R0, then AND with 0, other-
wise AND with 1.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:44:30 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

In the above, I was stating that the maximum width of LD/ST can be a lot
bigger than the size of a single register, not that the above instructions
make writing ATOMIC events easier.

These is no bus!

The esmLOCKload causes the <translated> address to be 'monitored'
for interference, and to announce participation in the ATOMIC event.

The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
AND sets up a default control point (This instruction itself) so that
if interference is detected at esmLOCKstore control is transferred to
that control point.

So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set
for free.

There is a branch-on-interference instruction that
a) does what it says,
b) sets up an alternate atomic control point.

It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't require additional hardware.

I am using the "Miss Buffer" as the point of monitoring for interference.
a) it already has to monitor "other hits" from outside accesses to deal
with the coherence mechanism.
b) that esm additions to Miss Buffer are on the order of 2%

c) there are other means to strengthen guarantees of forward progress.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Of course. It is interesting to speculate about possible features of an architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).

Even with a
single core system you can have pre-emptive multi-threading, or at least >> interrupt routines that may need to cooperate with other tasks on data.

and I don't think that C with just volatile gives you such guarantees. >>>>
- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 18:07:50 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >> >>> guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that >> >>> affects the hardware. So volatile writes are ordered at the C level, >> >>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency". >> >> If hardware guaranteed sequential consistency, volatile would provide >> >> guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >> >

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??

Individual atomic "Remove Element" and "Insert/Append Element"[*], yes. Combined? Too inflexible.

BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Atomic add/sub are useful. The other atomic math operations (min, max, etc) may be useful in certain cases as well.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Dec 6 19:04:09 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??

Nothing comes immediately to mind.

Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.

BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Dec 6 21:36:27 2025

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.

Interestingly, Linux restartable sequences allow for acquisition of
a lock with no membarrier or atomic instruction on the fast path,
at the cost of a syscall on the slow path (no free lunch...)

But you also need assembler to do it.

An example is, for example, at https://gitlab.ethz.ch/extra_projects/cpu-local-lock
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 21:44:17 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??

Nothing comes immediately to mind.

Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.

BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 18:33:55 2025

From Newsgroup: comp.arch

On 2025-12-06 12:29 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.

Often the use of R0 as an operand causes the calculation to be degenerate. That is, R0 is not needed at all.
ADD R9,R7,R0 // is a MOV instruction
AND R9,R7,R0 // is a CLR instruction

We dont want no degenerating instructions.

So, you don't have to treat R0 in bypassing, but as Operand processing.

Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
to zero only in the agen units. So, the bypass is only in a couple of
places. Otherwise r0 can be used as an ordinary register. Load / store
instructions cannot use r0 as a GPR then, but it works for the PowerPC.

AGEN Rbase ==R0 implies Rbase = IP
AGEN Rindex==R0 implies Rindex = 0

Qupls now follows a similar paradigm.
Rbase = r0 bypasses to 0
Rindex = r0 bypasses to 0
Rbase = r31 bypasses to IP
Bypassing r0 for both base and index allows absolute addressing mode.
Otherwise r0, r31 are general-purpose regs.

I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better
to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.

Anyone got a summary of bypassing r0 in different architectures?

These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random register.

Qupls has IP offset constant loading.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 18:55:17 2025

From Newsgroup: comp.arch

On 2025-12-06 6:33 p.m., Robert Finch wrote:

On 2025-12-06 12:29 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.

Often the use of R0 as an operand causes the calculation to be
degenerate.
That is, R0 is not needed at all.
-a-a-a-a-a ADD-a-a R9,R7,R0-a-a-a-a-a-a-a // is a MOV instruction
-a-a-a-a-a AND-a-a R9,R7,R0-a-a-a-a-a-a-a // is a CLR instruction

We dont want no degenerating instructions.

So, you don't have to treat R0 in bypassing, but as Operand processing.

Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple of
places. Otherwise r0 can be used as an ordinary register. Load / store
instructions cannot use r0 as a GPR then, but it works for the PowerPC.

AGEN Rbase ==R0 implies Rbase-a = IP
AGEN Rindex==R0 implies Rindex = 0

Qupls now follows a similar paradigm.
-aRbase = r0 bypasses to 0
-aRindex = r0 bypasses to 0
-aRbase = r31 bypasses to IP
Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.

I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.

Anyone got a summary of bypassing r0 in different architectures?

These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random
register.

Qupls has IP offset constant loading.

No sooner than having updated the spec, I added two more opcodes to
perform loads and stores using IP relative addressing. That way, no need
to use r31, leaving 31 registers completely general purpose. I am
wanting to cast some aspects of the ISA in stone, or it will never get anywhere.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 7 03:29:05 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-12-06 6:33 p.m., Robert Finch wrote:

On 2025-12-06 12:29 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000 >>> LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.

Often the use of R0 as an operand causes the calculation to be
degenerate.
That is, R0 is not needed at all.
-a-a-a-a-a ADD-a-a R9,R7,R0-a-a-a-a-a-a-a // is a MOV instruction
-a-a-a-a-a AND-a-a R9,R7,R0-a-a-a-a-a-a-a // is a CLR instruction

We dont want no degenerating instructions.

So, you don't have to treat R0 in bypassing, but as Operand processing. >>> Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple of

places. Otherwise r0 can be used as an ordinary register. Load / store >>> instructions cannot use r0 as a GPR then, but it works for the PowerPC. >>

AGEN Rbase ==R0 implies Rbase-a = IP
AGEN Rindex==R0 implies Rindex = 0

Qupls now follows a similar paradigm.
-aRbase = r0 bypasses to 0
-aRindex = r0 bypasses to 0
-aRbase = r31 bypasses to IP
Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.

I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.

Anyone got a summary of bypassing r0 in different architectures?

These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random >> register.

Qupls has IP offset constant loading.

No sooner than having updated the spec, I added two more opcodes to
perform loads and stores using IP relative addressing. That way, no need
to use r31, leaving 31 registers completely general purpose. I am
wanting to cast some aspects of the ISA in stone, or it will never get anywhere.

Cast some elements in plaster--this will hold for a few years until
you find the bigger mistakes, then demolish the plaster and fix the
parts that don't work so well.

After 6 years of essential stability, I did a major update to My 66000
ISA last month. The new ISA is ASCII compatible with the last, but not
at the binary level, which solves several problems and saves another
2%-4% in code footprint.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 7 09:30:50 2025

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for
microarchitecture specific features, if the need for performance
gain is large enough.

A primary example is Intel TSX, which is (was?) required by SAP.

POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.

Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.

It seems there is a market for going beyond compare and swap.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Dec 7 16:05:32 2025

From Newsgroup: comp.arch

On Sun, 7 Dec 2025 09:30:50 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers
will be unlikely to generate them, thus applications that desired
the generation of such an instruction would need to create a
compiler extension (like gcc __builtin functions) or inline
assembler which would then make the program that uses the
capability both compiler specific _and_ hardware specific.

This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for microarchitecture specific features, if the need for performance
gain is large enough.

A primary example is Intel TSX, which is (was?) required by SAP.

By SAP HANA, I assume.
Not sure for how long it was true. It sounds very unlikely that it is
still true.

POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.

Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.

Most extant SMP processors provide a compare and swap operation,
which are widely supported by the common compilers that support the
C and C++ threading functionality.

It seems there is a market for going beyond compare and swap.

TSX is close to dead.

ARM's TME was announced almost 5 years ago. AFAIK, there were no implementations. Recently ARM said that FEAT_TME is obsoleted. It sounds
like the whole thing is dead, but there is small chance that I am misinterpreting.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Dec 7 16:13:06 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?

Long experience. Back in the early 80's we had fancy instructions
for searching linked lists (up to 100 digit or byte keys, comparisons for equal, ne, lt, gt, lte, gte, and any-bit-equal). Took special language support to use, which mean that it wasn't usable from COBOL without
extensions. We also had Lock, Unlock and condition variable instructions (with a small microkernel to handle the contention cases, trapping on acquisition failure, release [when another thread was pending], and
event signal.). Perhaps ahead of its time, as most of the common languages (COBOL and Fortran) had no syntactical support for them. We used them
in the OS language (SPRITE), but they never got traction in applications (and then the
entire computer line was discontinued in 1991).

That's not to suggest that your innovations aren't potentially useful
or an interesting take on multithreaded instruction primitives;
just that idealism and the real world are often incompatible :-)

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Dec 7 16:28:41 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for >microarchitecture specific features, if the need for performance
gain is large enough.

A primary example is Intel TSX, which is (was?) required by SAP.

POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.

Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.

The ARM spec has been published. I'm not aware of any implementations
of it to date, and the spec had been available to architecture partners
for several years prior to 2022.

Intel's TSX support seems to be restricted to a subset of xeon processors,
and it's not clear how well it's supported by non-intel compilers.

AMD has never released their Advanced Synchronization Facility in any
processor to date.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 7 16:55:26 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

On Sun, 7 Dec 2025 09:30:50 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers
will be unlikely to generate them, thus applications that desired
the generation of such an instruction would need to create a
compiler extension (like gcc __builtin functions) or inline
assembler which would then make the program that uses the
capability both compiler specific _and_ hardware specific.

This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for
microarchitecture specific features, if the need for performance
gain is large enough.

A primary example is Intel TSX, which is (was?) required by SAP.

By SAP HANA, I assume.
Not sure for how long it was true. It sounds very unlikely that it is
still true.

https://www.redhat.com/en/blog/red-hat-enterprise-linux-performance-results-5th-gen-intel-xeon-scalable-processors
from 2024 has benchmarks with TSX for SAP/HANA, and the processors
(5th generation Xeon) at least pretend to have TSX.

https://community.sap.com/t5/technology-blog-posts-by-sap/seamless-scaling-of-sap-hana-on-intel-xeon-processors-from-micro-to-mega/ba-p/13968648
(almost a year old) writes

"Intel's Transactional Synchronization Extensions (TSX), also
implemented into the SAP HANA database, further enhances this
scalability and offers a significant performance boost for critical
HANA database operations."

which does not read "required", but certainly sounds like it is an
advantage.

POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.

Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.

Most extant SMP processors provide a compare and swap operation,
which are widely supported by the common compilers that support the
C and C++ threading functionality.

It seems there is a market for going beyond compare and swap.

TSX is close to dead.

For general-purpose computers, it seems the security implications
killed it. An SAP server is a different matter; if you don't trust
the software you are running there, you have other issues.

ARM's TME was announced almost 5 years ago. AFAIK, there were no implementations. Recently ARM said that FEAT_TME is obsoleted. It sounds
like the whole thing is dead, but there is small chance that I am misinterpreting.

Maybe restartable sequences are the way to go for lock-free
critical sections. Not sure if everybody is aware of these. A good introduction can be found at https://lwn.net/Articles/883104/ .
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Dec 7 12:19:34 2025

From Newsgroup: comp.arch

Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??

Nothing comes immediately to mind.

Atomically moving an object from one double linked list to another,
like when a thread wakes up and moves from the waiting to ready list.

One iteration of balancing a binary tree (AVL, red-black)

Plus the data structs above might straddle cache lines so how ever many
objects there are, there could be twice the lines being updated at once.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Dec 7 17:48:50 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?

More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?

IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.

Is single-core multi-threaded programming bearable to non-genius
programmers? I think so. Sequential consistency plus atomic sequences
(where the single-core program disables interrupts to start an atomic
sequence and enables them to end an atomic sequence) gives the same
programming model.

Concerning synchronization instructions and memory barriers of
architectures with weaker memory models, their main problem is that
they are implemented slowly, because the idea is to make only the
weaker memory model go fast, and then suffer what you must if you need
more guarantees. Already the guarantee makes them slow, not just the
actual synchronization case. This makes the memory model hard to use,
because you want to minimize the use of these instructions. And
that's where the need for genius-level coding comes in.

As for the size of the description, IMO this reflects on the
simplicity of programming. ARM's memory model was advertized here as:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. If it is
simple to program, why does it need 32 pages of description?

Concerning non-genius coders and coders that are not experts in memory
ordering models, the current setup seems to be design to have a few
people who program system software that does such things, and
everybody else should just use this software (whether it's system
calls or libraries). That's ok if the need to communicate between
threads is rare, but not so great if it is frequent (especially the
system-call variant). And if the need to communicate between threads
is rare, it's also good enough if the hardware features for that need
are slow. So maybe this whole setup is good enough.

OTOH, maybe there are applications that could potentially use multiple
threads that are currently using sequential programs or context
switching within a hardware thread (green threads and the like)
because the communication between the threads is too slow and making
it faster is too hard to program. In that case the underutilization
of many of the multi-core CPUs that we have may be due to this
phenomenon. If so, the argument that it's too expensive in hardware
resources to implement sequential consistency in hardware well does
not hold: Is it more expensive than implementing an 8-core CPU where 6 or 7 cores are usually not utilized?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 14:51:01 2025

From Newsgroup: comp.arch

On 12/5/2025 3:03 PM, Chris M. Thomasson wrote:

On 12/5/2025 11:10 AM, David Brown wrote:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

It's strange that double-word compare and swap (DWCAS), where the words
are contiguous. Well, I have seen compilers say its not lock-free even
on a x86. for a 32 bit system we have cmpxchg8b. For a 64 bit system cmpxchg16b. But the compiler reports not lock free. Strange.

using cmpxchg instead of xadd: https://forum.pellesc.de/index.php?topic=7167.0

trying to tell me that a DWCAS is not lock free: https://forum.pellesc.de/index.php?topic=7311.msg27764#msg27764

This should be lock-free on an x86, even x64:

struct ct_proxy_dwcas
{
-a-a-a struct ct_proxy_node* node;
-a-a-a intptr_t count;
};

Ideally, struct ct_proxy_dwcas should be aligned on a l2 cache line and
padded up the the size of a cache line.

some of my older code:

AC_SYS_APIEXPORT
int AC_CDECL
np_ac_i686_atomic_dwcas_fence
( void*,
-a void*,
-a const void* );

np_ac_i686_atomic_dwcas_fence PROC
-a push esi
-a push ebx
-a mov esi, [esp + 16]
-a mov eax, [esi]
-a mov edx, [esi + 4]
-a mov esi, [esp + 20]
-a mov ebx, [esi]
-a mov ecx, [esi + 4]
-a mov esi, [esp + 12]
-a lock cmpxchg8b qword ptr [esi]
-a jne np_ac_i686_atomic_dwcas_fence_fail
-a xor eax, eax
-a pop ebx
-a pop esi
-a ret

np_ac_i686_atomic_dwcas_fence_fail:
-a mov esi, [esp + 16]
-a mov [esi + 0],-a eax;
-a mov [esi + 4],-a edx;
-a mov eax, 1
-a pop ebx
-a pop esi
-a ret
np_ac_i686_atomic_dwcas_fence ENDP

Even with a single core system you can have pre-emptive multi-
threading, or at least interrupt routines that may need to cooperate
with other tasks on data.

and I don't think that C with just volatile gives you such guarantees. >>>>
- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 15:09:15 2025

From Newsgroup: comp.arch

On 12/6/2025 9:22 AM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/5/2025 12:54 PM, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>

You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Any issues with live lock in here?

A bit hard to tell because of 2 things::
a) I carry around the thread priority and when interference occurs,
the higher priority thread wins--ties the already accessed thread wins. b) live-lock is resolved or not by the caller to these routines, not
these routines themselves.

Hummm... Iirc, I was able to cause damage to a strong CAS. It was around
20 years ago. A thread was running strong CAS in a tight loop. I counted success vs failure. Then allowed some other threads that altered the
target word with random data. The failure rate for the CAS increased. Actually, I think cmpxchg, cmpxchg8b, cmpxchg16b, and the strange one on Itanium. Cannot remember it right now. cmp8xchg16? Or some shit.

Well, they would hit a bus lock if they failed too many times. I think
Scott knows about it.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 15:17:04 2025

From Newsgroup: comp.arch

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction.-a Or am I missing something here?

Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 16:08:03 2025

From Newsgroup: comp.arch

On 12/6/2025 1:36 PM, Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.

Interestingly, Linux restartable sequences allow for acquisition of
a lock with no membarrier or atomic instruction on the fast path,
at the cost of a syscall on the slow path (no free lunch...)

But you also need assembler to do it.

An example is, for example, at https://gitlab.ethz.ch/extra_projects/cpu-local-lock

I need to read more about them, but they kind of remind me of an
asymmetric mutex, or rwmutex. Ones that use a remote membar on the slow
path. Iirc, FlushProcessWriteBuffers on windows and iirc,
synchronize_rcu or membarrier on linux.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 16:36:59 2025

From Newsgroup: comp.arch

On 12/6/2025 10:07 AM, MitchAlsup wrote:

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>

You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??

Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.

BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Atomic add/sub are useful. The other atomic math operations (min, max, etc) >> may be useful in certain cases as well.

Have you ever read about KCSS?

https://groups.google.com/g/comp.arch/c/shshLdF1uqs

https://patents.google.com/patent/US7293143
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Dec 8 10:07:25 2025

From Newsgroup: comp.arch

On 06/12/2025 18:44, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>

You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.

That's what I assumed.

Certainly there are situations where it can be helpful to have longer
atomic reads and writes. I am not so sure about allowing 8 KB atomic accesses, especially in a system with multiple cores - that sounds like letting user programs DoS everything else on the system.

These is no bus!

I think there's a typo or some missing words there?

The esmLOCKload causes the <translated> address to be 'monitored'
for interference, and to announce participation in the ATOMIC event.

The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
AND sets up a default control point (This instruction itself) so that
if interference is detected at esmLOCKstore control is transferred to
that control point.

So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.

If I understand you correctly here, you basically have a "load-reserve / store-conditional" sequence as commonly found in RISC architectures, but
you have the associated loop built into the hardware? I can see that potentially improving efficiency, but I also find it very difficult to
read or write C code that has hidden loops. And I worry about how it
would all work if another thread on the same core or a different core
was running similar code in the middle of these sequences. It also
reduces the flexibility - in some use-cases, you want to have software
limits on the number of attempts of a lr/sc loop to detect serious synchronisation problems.

There is a branch-on-interference instruction that
a) does what it says,
b) sets up an alternate atomic control point.

It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't
require additional hardware.

I am using the "Miss Buffer" as the point of monitoring for interference.
a) it already has to monitor "other hits" from outside accesses to deal
with the coherence mechanism.
b) that esm additions to Miss Buffer are on the order of 2%

c) there are other means to strengthen guarantees of forward progress.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Of course. It is interesting to speculate about possible features of an
architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).

Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>

and I don't think that C with just volatile gives you such guarantees. >>>>>>
- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Dec 8 10:12:19 2025

From Newsgroup: comp.arch

On 08/12/2025 00:17, Chris M. Thomasson wrote:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide
enough
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware.-a So volatile writes are ordered at the C >>>>>>> level,
but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>

You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core
systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>

The functions below rely on more than that - to make the work, as far
as I can see, you need the first "esmLOCKload" to lock the bus and
also lock the core from any kind of interrupt or other pre-emption,
lasting until the esmLOCKstore instruction.-a Or am I missing something
here?

Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

Yes, I meant "lock the bus" - but I might have been overcautious.
However, it seems there is a hidden hardware loop here - the
esmLOCKstore instruction can fail and and the processor jumps back to
the first esmLOCKload instruction. With that, you don't need to block
other code from running or accessing the bus.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Mon Dec 8 07:25:42 2025

From Newsgroup: comp.arch

<snip>

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.

I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

I am assuming the esmLockStore() just unlocks what was previously locked
and the stores have already happened by that time.

It would seem that esmINTERFERENCE() would indicate that everybody with
access out to the coherence point has agreed to the locked area? Does
that require that all devices respect the esmINTERFERENCE()?

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 8 04:32:39 2025

From Newsgroup: comp.arch

On 12/8/2025 1:12 AM, David Brown wrote:

On 08/12/2025 00:17, Chris M. Thomasson wrote:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>> enough
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>> that
affects the hardware.-a So volatile writes are ordered at the C >>>>>>>> level,
but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>

You describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential
consistency".
If hardware guaranteed sequential consistency, volatile would
provide
guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core
systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>>

The functions below rely on more than that - to make the work, as far
as I can see, you need the first "esmLOCKload" to lock the bus and
also lock the core from any kind of interrupt or other pre-emption,
lasting until the esmLOCKstore instruction.-a Or am I missing
something here?

Lock the BUS? Only when shit hits the fan. What about locking the
cache line? Actually, I think we can "force" an x86/x64 to lock the
bus if we do a LOCK'ed RMW on memory that straddles cache lines?

Yes, I meant "lock the bus" - but I might have been overcautious.
However, it seems there is a hidden hardware loop here - the
esmLOCKstore instruction can fail and and the processor jumps back to
the first esmLOCKload instruction.-a With that, you don't need to block other code from running or accessing the bus.

Humm.. For some damn reason it reminds me of a multi lock thing I did a
while back. Called it the multex. Consisted of a table of locks. A
thread would take the addresses it wanted to lock, hash then into the
table, remove duplicates and sorted them and took them all without any
fear of deadlock.

(read all when you get some free time to burn...) https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

It kind of seems like it might want to work with Mitch's scheme in a
loose sense?
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Dec 8 08:23:59 2025

From Newsgroup: comp.arch

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

BOOLEAN RemoveElement( Element *fr )
{
-a-a-a-a fn = esmLOCKload( fr->next );
-a-a-a-a fp = esmLOCKload( fr->prev );
-a-a-a-a esmLOCKprefetch( fn );
-a-a-a-a esmLOCKprefetch( fp );
-a-a-a-a if( !esmINTERFERENCE() )
-a-a-a-a {
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a fp->next = fn;
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a fn->prev = fp;
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a fr->prev = NULL;
-a-a-a-a esmLOCKstore( fr->next,-a NULL );
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a return TRUE;
-a-a-a-a }
-a-a-a-a return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally
sufficient.

Yes, you can add special instructions.-a-a However, the compilers will
be unlikely
to generate them, thus applications that desired the generation of
such an
instruction would need to create a compiler extension (like gcc
__builtin functions)
or inline assembler which would then make the program that uses the
capability both compiler
specific _and_ hardware specific.

So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.

I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus nothing is required of other cores, no locks, etc. If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.

I am assuming the esmLockStore() just unlocks what was previously locked
and the stores have already happened by that time.

There is no "locking" in the sense of preventing any accesses.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Dec 8 17:14:11 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across
buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same
address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined
range surrounding the target address and the store will fail if any other
agent has modified any byte within the exclusive range.

esmINTERFERENCE seems to require multiple of these exclusive blocks
to cover non-contiguous address ranges, which on first blush leads
me to worry both about deadlock situations and starvation issues.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:06:34 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>

You describe in many words and not really to the point what can be >>>>> explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>> atomic operations beyond load and store (even on single-core systems), >>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting until the esmLOCKstore instruction.-a Or am I missing something here?

Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Dec 8 20:15:13 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >> >>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >> >>>>>> affects the hardware.-a So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency". >> >>>>> If hardware guaranteed sequential consistency, volatile would provide >> >>>>> guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >> >>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >> >>

The functions below rely on more than that - to make the work, as far as >> > I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction.-a Or am I missing something here?

Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

What if two processors have intersecting (but not fully overlapping)
sets of those 8 cache lines?

Can you guarantee forward progress?
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:20:27 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 06/12/2025 18:44, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>

You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as >> I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.

That's what I assumed.

Certainly there are situations where it can be helpful to have longer
atomic reads and writes. I am not so sure about allowing 8 KB atomic accesses, especially in a system with multiple cores - that sounds like letting user programs DoS everything else on the system.

These is no bus!

I think there's a typo or some missing words there?

There is a fabric based interconnect to transport data-transfer requests
around the system, where everyone connected to the transport can send
a new request, receive a response, and receive a SNOOP simultaneously.

There is NO single point on the fabric one can GRAB and prevent other
sections of the fabric from "doing their prescribed transport duties.

There is a memory ordering protocol in L3/DRAM-controller that prevents
more than one "SNOOP per cache line" from being "in progress" at the
same time.

The esmLOCKload causes the <translated> address to be 'monitored'
for interference, and to announce participation in the ATOMIC event.

The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
AND sets up a default control point (This instruction itself) so that
if interference is detected at esmLOCKstore control is transferred to
that control point.

So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.

If I understand you correctly here, you basically have a "load-reserve / store-conditional" sequence as commonly found in RISC architectures, but
you have the associated loop built into the hardware?

In effect, yes. I have a multi-{LoadLocked StoreConditional} scheme
as found in other RISC architectures with several small/big changes::
a) you get up to 8 LLs
b) the last SC causes the rest of the system to see all the memory
changes at the same time (or nobody sees any changes).
c) The ATOMIC sequence cannot persist across an exception or interrupt.
d) only participating memory lines have the ATOMIC property.

And yes, control transfer is built-into the architecture.

I can see that potentially improving efficiency, but I also find it very difficult to
read or write C code that has hidden loops. And I worry about how it
would all work if another thread on the same core or a different core
was running similar code in the middle of these sequences. It also
reduces the flexibility - in some use-cases, you want to have software limits on the number of attempts of a lr/sc loop to detect serious synchronisation problems.

In this case, said SW would use the Branch-on-interference instruction.

There is a branch-on-interference instruction that
a) does what it says,
b) sets up an alternate atomic control point.

It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't >> require additional hardware.

I am using the "Miss Buffer" as the point of monitoring for interference. a) it already has to monitor "other hits" from outside accesses to deal
with the coherence mechanism.
b) that esm additions to Miss Buffer are on the order of 2%

c) there are other means to strengthen guarantees of forward progress.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Of course. It is interesting to speculate about possible features of an >> architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).

Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>

and I don't think that C with just volatile gives you such guarantees. >>>>>>
- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:30:34 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

<snip>

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.

I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

esmLOCKLoad sets up monitors (in Miss Buffers) that detect Snoops to
the participating cache lines.

esmINTERFERENCE sets up a block of code that either executes in its
entirety or fails in its entirety--and transfers control.

In "certain circumstances" the code inside the esmINTERFERENCE block
are allowed to NaK SNOOPs to those lines. So, if interference happens
this late, you can effectively tell requestor "Yes, I have that cache
line, No you cannot have it right now".

If requestor gets a NaK, and requestor was attempting an ATOMIC event,
the event fails. If requestor was NOT attempting, requestor resubmits
the request. In both cases, the thread causing the interference is the
one delayed while the one performing the event has higher probability
of success.

I am assuming the esmLockStore() just unlocks what was previously locked
and the stores have already happened by that time.

Yes, it is the terminal sentinel.

It would seem that esmINTERFERENCE() would indicate that everybody with access out to the coherence point has agreed to the locked area? Does
that require that all devices respect the esmINTERFERENCE()?

I can see you are getting at something subtle, here. I cannot quite grasp
what it might be.

Can you ask the above again but use different words ?!?
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:35:01 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as >> a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across >> buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same >> address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.

esmINTERFERENCE seems to require multiple of these exclusive blocks
to cover non-contiguous address ranges, which on first blush leads
me to worry both about deadlock situations and starvation issues.

Over in the Miss Buffer there are (at least) 8 miss buffers. Each miss
buffer has to monitor inbound messages for requests (SNOOPs) to its
entry.

So, each MB entry has a bit to tell if it is participating in an event. esmINTERFERENCE is a way to sample all participating MB entries simul- taneously; and in addition, esmINTERFERENCE is part of what enables
the NaKing of SNOOP requests.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 21:58:00 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

ERROR "unexpected byte sequence starting at index 736: '\xC2'" while decoding:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware.|e-a So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever. >> >>>>>

You describe in many words and not really to the point what can be >> >>>>> explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.

However, for concurrent manipulations of data structures, one wants >> >>>>> atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >> >>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32|e-a|e-a DWs|e-a|e-a as a single ATOMIC instruction.
MM|e-a|e-a|e-a|e-a|e-a can MOV|e-a|e-a up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting >> > until the esmLOCKstore instruction.|e-a Or am I missing something here? >>

Lock the BUS? Only when shit hits the fan. What about locking the cache >> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >> do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

What if two processors have intersecting (but not fully overlapping)
sets of those 8 cache lines?

Can you guarantee forward progress?

Yes.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 8 16:31:08 2025

From Newsgroup: comp.arch

On 12/8/2025 9:14 AM, Scott Lurndal wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as >>> a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across >>> buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same >>> address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus
nothing is required of other cores, no locks, etc. If they write to a
"protected" location, the write is allowed, but the core in the ESM is
notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.

Any mutation the reservation granule?

esmINTERFERENCE seems to require multiple of these exclusive blocks
to cover non-contiguous address ranges, which on first blush leads
me to worry both about deadlock situations and starvation issues.

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Dec 9 09:13:54 2025

From Newsgroup: comp.arch

On 08/12/2025 17:23, Stephen Fuld wrote:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes
as a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across buffers and bus bridges. It would have to go to the memory
coherence point. Otherwise, some other device using a bridge could
update the same address range in the middle of an update.

I may be wrong about this, but I think you have a misconception.-a The
ESM doesn't *prevent* interference, but it *detect* interference.-a Thus nothing is required of other cores, no locks, etc.-a If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.

Yes, that is correct (as far as I understand it now). The critical part
is the hidden hardware loop that was not mentioned or indicated in the original code.

There are basically two ways to handle atomic operations. One way is to
use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)

I am assuming the esmLockStore() just unlocks what was previously
locked and the stores have already happened by that time.

There is no "locking" in the sense of preventing any accesses.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 19:15:48 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 08/12/2025 17:23, Stephen Fuld wrote:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes
as a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across buffers and bus bridges. It would have to go to the memory
coherence point. Otherwise, some other device using a bridge could
update the same address range in the middle of an update.

---------------------------------

I may be wrong about this, but I think you have a misconception.-a The
ESM doesn't *prevent* interference, but it *detect* interference.-a Thus nothing is required of other cores, no locks, etc.-a If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.

Yes, that is correct (as far as I understand it now). The critical part
is the hidden hardware loop that was not mentioned or indicated in the original code.

---------------------------------

Mostly esm detects interference but there are times when esm is allowed
to ignore interference.

Consider a sever scale esm implementation. In such an implementation, esm
is enhanced with a system* arbiter.

After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting
no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.

At this point the core is in "careful" mode, core becomes sequentially consistent, SW chooses to re-run the event. Here, cache misses leave
core in program order,... When interference is detected, the event fails
and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
If core can determine that all writes to participating memory can be
performed (at the first participating store) core is allowed to NaK
lower priority interfering accesses.

At this point the core is in "Slow and Methodological" mode. Now, after
all participating cache lines have been touched, all the physical pointers
are bundled into a message and sent to the system arbiter. System arbiter examines each cache line address and if no-other-core has a reservation
on ANY of them, then system arbiter installs said reservations, and
returns "success". At this point, core is allowed to NaK interfering
accesses. This event WILL SUCCEED. After the event is complete, the
termination of the event at the core, takes the same bundle of addresses
and sends it back to system arbiter; who removes them from reservation.

Optimistic mode takes no more cycles than if the memory references were
not ATOMIC.

I should also note:: none of this state is preserved across interrupts
or exceptions. So, an interrupt or exception causes the event to fail
prior to control transfer. Interrupts do not care about this control
transfer. Exception control transfer in My 66000 packs everything the
exception handler needs in registers, so having IP point at ATOMIC
control point with the registers setup for page fault does not cause
exception handler any issues whatsoever.

There are basically two ways to handle atomic operations. One way is to
use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in hardware, but there are no benefits in hiding it from the programmer!)

How exactly do you inform the programmer that:

InBound [Address]
OutBound [Address]

operates like::

try_again:
InBound [Address]
BIN try_again
OutBound [Address]

And why clutter up asm with extraneous labels and require extra instructions. --- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Dec 9 20:51:26 2025

From Newsgroup: comp.arch

On 09/12/2025 20:15, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

There are basically two ways to handle atomic operations. One way is to
use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the
situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a
hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)

How exactly do you inform the programmer that:

InBound [Address]
OutBound [Address]

operates like::

try_again:
InBound [Address]
BIN try_again
OutBound [Address]

And why clutter up asm with extraneous labels and require extra instructions.

The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.

Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter nearly as much.)

So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
of better names, but that would at least give the reader a clue that
there's something odd going on.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 21:28:47 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 09/12/2025 20:15, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

There are basically two ways to handle atomic operations. One way is to >> use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the >> situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a >> hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)

How exactly do you inform the programmer that:

InBound [Address]
OutBound [Address]

operates like::

try_again:
InBound [Address]
BIN try_again
OutBound [Address]

And why clutter up asm with extraneous labels and require extra instructions.

The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.

Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter nearly as much.)

So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
of better names, but that would at least give the reader a clue that
there's something odd going on.

This is a useful suggestion; thanks.

On the other hand, there are some non-vonNeumann actions lurking within
esm. Where vonNeumann means: that every instruction is executed in its
entirety before the next instruction appears to start executing.

1st:: one cannot single step through an ATMOIC event, if you enter an
ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.

2nd::the only way to debug an event is to have a buffer of SW locations
that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan-
taneously or not modified at all.

So, here we have non-participating STs having been written and older participating STs have not.

3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.

4th:: one cannot test esm with a random code generator, since the probability that the random code generator creates a legal esm event is exceedingly low. --- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Dec 9 13:55:12 2025

From Newsgroup: comp.arch

On 12/9/2025 11:15 AM, MitchAlsup wrote:

snip

Mostly esm detects interference but there are times when esm is allowed
to ignore interference.

Consider a sever scale esm implementation. In such an implementation, esm
is enhanced with a system* arbiter.

After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting
no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
At this point the core is in "careful" mode,

I am missing some understanding here, about this "counter". This
paragraph seems to indicate that after one failure, the core goes into "careful" mode, but if that were true, you wouldn't need a "counter",
just a mode bit. So assuming it is a counter and you need "n" failures
in a row to go into careful mode, is "n" hardwired or settable by
software? What are the tradeoffs for smaller or larger values of "n"?

core becomes sequentially
consistent, SW chooses to re-run the event. Here, cache misses leave
core in program order,... When interference is detected, the event fails
and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
lower priority interfering accesses.

Again, after a single failure in careful mode or n failures? If n, is
it the same value of n as for the transition from optimistic to careful
mode? Same questions as before about who sets the value and is it
software changeable?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 22:52:31 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 12/9/2025 11:15 AM, MitchAlsup wrote:

snip

Mostly esm detects interference but there are times when esm is allowed
to ignore interference.

Consider a sever scale esm implementation. In such an implementation, esm is enhanced with a system* arbiter.

After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
At this point the core is in "careful" mode,

I am missing some understanding here, about this "counter". This
paragraph seems to indicate that after one failure, the core goes into "careful" mode, but if that were true, you wouldn't need a "counter",
just a mode bit. So assuming it is a counter and you need "n" failures
in a row to go into careful mode, is "n" hardwired or settable by
software? What are the tradeoffs for smaller or larger values of "n"?

2-bits; 3-states--not part of save thread state.

core becomes sequentially
consistent, SW chooses to re-run the event. Here, cache misses leave
core in program order,... When interference is detected, the event fails and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
lower priority interfering accesses.

Again, after a single failure in careful mode or n failures? If n, is
it the same value of n as for the transition from optimistic to careful mode? Same questions as before about who sets the value and is it
software changeable?

3-state counter::

00 -> Optimistic
01 -> Careful
10 -> Slow and methodological

success -> counter = 00;
failure -> counter++;
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Dec 10 10:07:19 2025

From Newsgroup: comp.arch

On 09/12/2025 22:28, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 09/12/2025 20:15, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)

How exactly do you inform the programmer that:

InBound [Address]
OutBound [Address]

operates like::

try_again:
InBound [Address]
BIN try_again
OutBound [Address]

And why clutter up asm with extraneous labels and require extra instructions.

The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.

Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter
nearly as much.)

So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()". Feel free to think
of better names, but that would at least give the reader a clue that
there's something odd going on.

This is a useful suggestion; thanks.

I can certainly say they would help /me/ understand the code, so maybe
they would help other people understand it too.

On the other hand, there are some non-vonNeumann actions lurking within
esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.

That's a rather different use of the term "vonNeumann" from anything I
have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
And are we thinking about the instructions purely from the viewpoint of
the cpu executing them?

IME, most instructions on most processors are indivisible, but most
processors have some instructions that are not. For example, processors
can have load/store multiple instructions that are interruptable - in
some cases, after returning from the interrupt (and any associated
thread context switches) the instructions are restarted, in other cases
they are continued.

But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming,
pipelining, speculative execution, dependency tracking, and all the rest
of it.

1st:: one cannot single step through an ATMOIC event, if you enter an
ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.

That is presumably a choice you made for the debugging features of the
device.

2nd::the only way to debug an event is to have a buffer of SW locations
that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan- taneously or not modified at all.

So, here we have non-participating STs having been written and older participating STs have not.

3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.

OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics). My main concern was
the disconnect between how the code was written and what it actually does.

4th:: one cannot test esm with a random code generator, since the probability that the random code generator creates a legal esm event is exceedingly low.

Testing and debugging any kind of locking or atomic access solution is
always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Dec 10 08:51:16 2025

From Newsgroup: comp.arch

On 12/10/2025 1:07 AM, David Brown wrote:

On 09/12/2025 22:28, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 09/12/2025 20:15, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

There are basically two ways to handle atomic operations.-a One way >>>>> is to
use locking mechanisms to ensure that nothing (other cores, interrupts >>>>> or other pre-emption on the same core) can break up the sequence.-a The >>>>> other way is to have a mechanism to detect conflicts and a failure of >>>>> the atomic operation, so that you can try again (or otherwise
handle the
situation).-a (You can, of course, combine these - such as by disabling >>>>> local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms,
hence
my confusion.-a It turns out that it /does/ have conflict detection >>>>> and a
hardware retry loop, all hidden from anyone trying to understand the >>>>> code.-a (I can appreciate that there may be benefits in doing this in >>>>> hardware, but there are no benefits in hiding it from the programmer!) >>>>

How exactly do you inform the programmer that:

-a-a-a-a-a-a-a-a InBound-a-a [Address]
-a-a-a-a-a-a-a-a OutBound-a [Address]

operates like::

try_again:
-a-a-a-a-a-a-a-a InBound-a-a [Address]
-a-a-a-a-a-a-a-a BIN-a-a-a-a-a-a try_again
-a-a-a-a-a-a-a-a OutBound-a [Address]

And why clutter up asm with extraneous labels and require extra
instructions.

The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.

Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level.-a (Assembly instruction names don't matter >>> nearly as much.)

So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()".-a Feel free to think >>> of better names, but that would at least give the reader a clue that
there's something odd going on.

This is a useful suggestion; thanks.

I can certainly say they would help /me/ understand the code, so maybe
they would help other people understand it too.

On the other hand, there are some non-vonNeumann actions lurking within
esm. Where vonNeumann means: that every instruction is executed in its
entirety before the next instruction appears to start executing.

That's a rather different use of the term "vonNeumann" from anything I
have heard.-a I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
And are we thinking about the instructions purely from the viewpoint of
the cpu executing them?

IME, most instructions on most processors are indivisible, but most processors have some instructions that are not.-a For example, processors can have load/store multiple instructions that are interruptable - in
some cases, after returning from the interrupt (and any associated
thread context switches) the instructions are restarted, in other cases
they are continued.

But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing.-a Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
of it.

1st:: one cannot single step through an ATMOIC event, if you enter an
ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.

That is presumably a choice you made for the debugging features of the device.

2nd::the only way to debug an event is to have a buffer of SW locations
that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially
consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan-
taneously or not modified at all.

So, here we have non-participating STs having been written and older
participating STs have not.

3rd:: control transfer not under SW control--more like exceptions and
interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.

OK.-a I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,

Yes, but. ISTM there is a hardware limit on the number of retries - it
is two retries, as the third try (second retry) is guaranteed to
succeed, albeit at a higher cost (in time and interference with other threads/processes) compared to the earlier tries.

or add SW tracking of retry counts for metrics).

Again, ISTM that you could do some software tracking by using non participating stores within the locked area to save information outside
the locked area. I haven't thought through the cost benefit of this,
how much to save, etc.

But I am not sure that the "escalation" to a more "intrusive" mechanism
upon a single failure is optimal. Perhaps it would be better to retry
once or twice using the current mechanism. I don't have a good feeling
for what is optimal here, and to what extent the optimal choice would be workload dependent.

My main concern was
the disconnect between how the code was written and what it actually does.

4th:: one cannot test esm with a random code generator, since the
probability
that the random code generator creates a legal esm event is
exceedingly low.

Testing and debugging any kind of locking or atomic access solution is always very difficult.

Yup!

You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

:-)
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 10 20:10:43 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 09/12/2025 22:28, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 09/12/2025 20:15, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of >>>> the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the >>>> code. (I can appreciate that there may be benefits in doing this in >>>> hardware, but there are no benefits in hiding it from the programmer!) >>>

How exactly do you inform the programmer that:

InBound [Address]
OutBound [Address]

operates like::

try_again:
InBound [Address]
BIN try_again
OutBound [Address]

And why clutter up asm with extraneous labels and require extra instructions.

The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.

Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter >> nearly as much.)

So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()". Feel free to think >> of better names, but that would at least give the reader a clue that
there's something odd going on.

This is a useful suggestion; thanks.

I can certainly say they would help /me/ understand the code, so maybe
they would help other people understand it too.

On the other hand, there are some non-vonNeumann actions lurking within esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.

That's a rather different use of the term "vonNeumann" from anything I
have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
And are we thinking about the instructions purely from the viewpoint of
the cpu executing them?

An ATOMIC event is a series of instructions that appear to be performed
all at once--as if the whole series was "indivisible".

IME, most instructions on most processors are indivisible, but most processors have some instructions that are not. For example, processors
can have load/store multiple instructions that are interruptable - in
some cases, after returning from the interrupt (and any associated
thread context switches) the instructions are restarted, in other cases
they are continued.

Go in the other direction, where a series of instructions HAS TO APPEAR
as if executed instantaneously.

But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
of it.

None of those things is ARHICTECTURAL--esm is an architectural window into
how to program ATOMIC events such no future generation of the ISA has to continuously add more synchronization instructions. One can program every known industrial and academic synchronization primitive in esm without ever adding new synchronization instructions.

1st:: one cannot single step through an ATMOIC event, if you enter an ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.

That is presumably a choice you made for the debugging features of the device.

No it is the nature of executing a series of instructions as if instantaneously.

2nd::the only way to debug an event is to have a buffer of SW locations that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan- taneously or not modified at all.

So, here we have non-participating STs having been written and older participating STs have not.

3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.

OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).

esm attempts to allow SW to program with features previously available
only at the -|Code level. -|Code allows for many -|instructions to execute before/between any real instructions.

My main concern was
the disconnect between how the code was written and what it actually does.

There is a 26 page specification the programmer needs to read and understand. This includes things we have not talked about--such as::
a) terminating an event without writing anything
b) proactively minimizing future interference
c) modifications to cache coherence model
at the architectural level.

The architectural specification allows for various scales of -|Architecture
to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order -|Controller are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers
the whole gamut.

4th:: one cannot test esm with a random code generator, since the probability
that the random code generator creates a legal esm event is exceedingly low.

Testing and debugging any kind of locking or atomic access solution is always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

Right at Christmas time !! {Ask me how I know}.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 11 20:26:09 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 10/12/2025 21:10, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).

esm attempts to allow SW to program with features previously available
only at the -|Code level. -|Code allows for many -|instructions to execute before/between any real instructions.

My main concern was
the disconnect between how the code was written and what it actually does.

Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

There is a 26 page specification the programmer needs to read and understand.
This includes things we have not talked about--such as::
a) terminating an event without writing anything
b) proactively minimizing future interference
c) modifications to cache coherence model
at the architectural level.

Fair enough. This is not a minor or simple feature!

No, it is a design that allows for ISA to remain static while all sorts of synchronization stuff gets written, tested, and tuned.

The architectural specification allows for various scales of -|Architecture to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order -|Controller are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers the whole gamut.

4th:: one cannot test esm with a random code generator, since the probability
that the random code generator creates a legal esm event is exceedingly low.

Testing and debugging any kind of locking or atomic access solution is
always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

Right at Christmas time !! {Ask me how I know}.

We can gather round the fire, and Grampa can settle in his rocking chair
to tell us war stories from the olden days :-)

A good story is always nice, so go for it!

Year:: 1997, time 7 days before Christmas:: situation, Customer is
having (and has had) strange bugs that happen about once a week.
Customer is unhappy, we have had a senior engineer on sight for
4 months without forward progress. We were told "You don't come home
until the problem is fixed".

System:: 2 (or more) of our cache coherent motherboards, connected
with a proven cache coherent buss.

On the flight from Austin to Manchester England, I decide that what
we have is a physics experiment. So, when we arrive, I had their SW
guy code up a routine that as soon as it got a time slice, it would
signal it no longer needed time. While we hooked up the logic analyzer
to our motherboards and to their bus. When SW was ready (about 30 minutes)
we tried the case--Instantly, the time delay between the bug showing up
went from once a week to milliseconds. We spent the afternoon taking
logic analyzer traces, and went to dinner.

The next day, we went through the traces with a fine tooth comb and
found a smoking gun--so we ran more experiments and this same smoking
gun was found in each track. After a couple of hours, we found that
their proven coherent bus was allowing 1 single cycle where our bus
could be seen in an inconsistent state. and it was only a dozen
cycles downstream that the crash was transpiring.

It turns out that their bus was only coherent when the attached bus
was slower than 4 cycles to do "random coherent message", whereas
our bus was times at 2 cycles for this response.

So, we took their FPGA which ran the bus apart and found out how to
delay one signal, reprogrammed it--ONLY to run into another message
that was off by 1 or 2 cycles. This one took a whole day to find and
program around.

We both made it home for Christmas, and in some part saved the company...

(We once had a system where there was a bug that not only triggered only
at the customer's site, but did so only on the 30th of September. It
took years before we made the connection to the date and found the bug.)

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Dec 11 20:47:12 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

What _would_ be useful on occasion would be an assembler which
could do register assignment, for example for a small function.
It would be OK if this were to issue an error if there were too
many variables for assignment.

Does anybody know of such a beast?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Dec 11 23:51:26 2025

From Newsgroup: comp.arch

On Thu, 11 Dec 2025 20:26:09 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

We both made it home for Christmas, and in some part saved the
company...

Not for long so... Was not it dead anyway in the 6-7 months?

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:00:53 2025

From Newsgroup: comp.arch

On 12/10/2025 1:07 AM, David Brown wrote:
[...]

Testing and debugging any kind of locking or atomic access solution is always very difficult.-a You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

Murphy's Law. Actually, have you ever messed around with Relacy Race
Detector? Its pretty interesting.

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:03:29 2025

From Newsgroup: comp.arch

On 12/11/2025 3:02 PM, Chris M. Thomasson wrote:

On 12/11/2025 1:05 AM, David Brown wrote:

On 10/12/2025 21:10, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

OK.-a I can see the advantages of that - though there are disadvantages >>>> too (such as being unable to control a limit on the number of retries, >>>> or add SW tracking of retry counts for metrics).

esm attempts to allow SW to program with features previously available
only at the -|Code level. -|Code allows for many -|instructions to execute >>> before/between any real instructions.

-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a My main concern was
the disconnect between how the code was written and what it actually
does.

Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and
you don't want optimisers re-arranging things too much.

Right. Way back before C/C++ 11 I would code all of my sensitive lock/ wait-free code in assembly.

[...]

Actually, I would turn off link-time optimization back then.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:02:40 2025

From Newsgroup: comp.arch

On 12/11/2025 1:05 AM, David Brown wrote:

On 10/12/2025 21:10, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

OK.-a I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).

esm attempts to allow SW to program with features previously available
only at the -|Code level. -|Code allows for many -|instructions to execute >> before/between any real instructions.

-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a My main concern was
the disconnect between how the code was written and what it actually
does.

Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.

Right. Way back before C/C++ 11 I would code all of my sensitive lock/wait-free code in assembly.

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 12 08:59:12 2025

From Newsgroup: comp.arch

On 11/12/2025 22:51, Michael S wrote:

On Thu, 11 Dec 2025 20:26:09 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

We both made it home for Christmas, and in some part saved the
company...

Not for long so... Was not it dead anyway in the 6-7 months?

This is why stories end with "they all lived happily ever after", and
why sequel movies are almost always terrible! I liked the first story
better.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:37:03 2025

From Newsgroup: comp.arch

On 12/8/2025 12:06 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>>>>> but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>

You describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>>

The functions below rely on more than that - to make the work, as far as >>> I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction.-a Or am I missing something here?

Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system.
or cmpxchg16b on a 64-bit system.

People would bend over backwards to get a DCAS, or NCAS. It would be
infested with strange indirection ala d"escriptors", and involved a shit
load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:39:16 2025

From Newsgroup: comp.arch

On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:

On 12/8/2025 12:06 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enough
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>>> that
affects the hardware.-a So volatile writes are ordered at the C >>>>>>>>> level,
but that says nothing about how they might progress through >>>>>>>>> storage
queues, caches, inter-processor communication buses, or whatever. >>>>>>>>

You describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory
ordering
on hardware with weaker memory ordering than sequential
consistency".
If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
guarantees that are as good on multi-core machines as on single- >>>>>>>> core
machines.

However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core
systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>>>

The functions below rely on more than that - to make the work, as
far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction.-a Or am I missing something here?

Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system.
or cmpxchg16b on a 64-bit system.

People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

Have you ever read about KCSS?

https://groups.google.com/g/comp.arch/c/shshLdF1uqs

https://patents.google.com/patent/US7293143
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:47:50 2025

From Newsgroup: comp.arch

On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:

On 12/8/2025 12:06 PM, MitchAlsup wrote:

[...]

People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

I am trying to convey that a lot of neat algos do not even need the
fancy DCAS, NCAS.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 12 23:39:53 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:

On 12/8/2025 12:06 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enough
guarantees for multi-threaded coding on multi-core systems. >>>>>>>>> Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>>> that
affects the hardware.-a So volatile writes are ordered at the C >>>>>>>>> level,
but that says nothing about how they might progress through >>>>>>>>> storage
queues, caches, inter-processor communication buses, or whatever. >>>>>>>>

You describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory >>>>>>>> ordering
on hardware with weaker memory ordering than sequential
consistency".
If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
guarantees that are as good on multi-core machines as on single- >>>>>>>> core
machines.

However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core >>>>>>>> systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as
far as
I can see, you need the first "esmLOCKload" to lock the bus and also >>>> lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction.-a Or am I missing something here? >>>

Lock the BUS? Only when shit hits the fan. What about locking the cache >>> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >>> do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system. or cmpxchg16b on a 64-bit system.

People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

Have you ever read about KCSS?

https://groups.google.com/g/comp.arch/c/shshLdF1uqs

https://patents.google.com/patent/US7293143

While I was not directly exposed to KCSS, I was exposed to the underlying
need for multi-location Compare and Swap requirements, and provided a means
to implement same in both ASF and ESM. {All of us (synchronization people)
were so exposed. And a lot of academic ideas came out of those trends, too.}

In my case, I simply wanted a way "out" of inventing a new synchronization primitive ever ISA generation. What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you referenced, but, no you can't have it right now" in order to strengthen
the guarantees of forward progress.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 15:52:45 2025

From Newsgroup: comp.arch

On 12/6/2025 11:04 AM, Scott Lurndal wrote:
[...]

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.

Right. However, a DWCAS is important as well... Well, for me... This
only works on contiguous words.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 15:56:52 2025

From Newsgroup: comp.arch

On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:

On 12/8/2025 9:14 AM, Scott Lurndal wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the
same
address range in the middle of an update.

I may be wrong about this, but I think you have a misconception.-a The
ESM doesn't *prevent* interference, but it *detect* interference.-a Thus >>> nothing is required of other cores, no locks, etc.-a If they write to a
"protected" location, the write is allowed, but the core in the ESM is
notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other
agent has modified any byte within the exclusive range.

Any mutation the reservation granule?

I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Dec 13 09:31:05 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen
the guarantees of forward progress.

How does it strengthen the guarantees of forward progress? My guess:
If the requester itself is in an atomic sequence B, it will cancel it.
This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.

There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 19:03:07 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:

On 12/8/2025 9:14 AM, Scott Lurndal wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence >>>> point. Otherwise, some other device using a bridge could update the >>>> same
address range in the middle of an update.

I may be wrong about this, but I think you have a misconception.-a The >>> ESM doesn't *prevent* interference, but it *detect* interference.-a Thus >>> nothing is required of other cores, no locks, etc.-a If they write to a >>> "protected" location, the write is allowed, but the core in the ESM is >>> notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other >> agent has modified any byte within the exclusive range.

Any mutation the reservation granule?

I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.

A LD to the granule would cause loss of write permission, causing a long
delay to perform SC and greatly increase the probability of interference.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 19:12:28 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen >the guarantees of forward progress.

How does it strengthen the guarantees of forward progress?

The allowance of a NaK is only available under somewhat special
circumstances::
a) in Careful mode:: when core can see that all STs have write permission
and data is present, NaKs allow the Modification part to run to
completion.
b) In Slow and Methodical mode:: core can NaK any access to any of its
cache lines--preventing interference.

My guess:
If the requester itself is in an atomic sequence B, it will cancel it.

Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
the event by the time the innocent request shows up again.

This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.

There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.

Instead of some contrived back-off policy--at the failure point one can
read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be.
So, if you are going after a unit of work, you march down the queue WHY
units and then YOU are guaranteed that YOU are the only one after that
unit of work.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Dec 13 11:49:46 2025

From Newsgroup: comp.arch

On 12/13/2025 11:03 AM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:

On 12/8/2025 9:14 AM, Scott Lurndal wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be >>>>>> locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence >>>>>> point. Otherwise, some other device using a bridge could update the >>>>>> same
address range in the middle of an update.

I may be wrong about this, but I think you have a misconception.-a The >>>>> ESM doesn't *prevent* interference, but it *detect* interference.-a Thus >>>>> nothing is required of other cores, no locks, etc.-a If they write to a >>>>> "protected" location, the write is allowed, but the core in the ESM is >>>>> notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor" >>>> (the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined >>>> range surrounding the target address and the store will fail if any other >>>> agent has modified any byte within the exclusive range.

Any mutation the reservation granule?

I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.

A LD to the granule would cause loss of write permission, causing a long delay to perform SC and greatly increase the probability of interference.

So, you need to create a rule. If you program for my system, you MUST
make sure that everything is properly aligned and padded. Been there,
done that. Now, think of nefarious agents... I was able to cause damage
to a simple strong CAS loop with another thread(s) mutating the cache
line on purpose, as a stress test... CAS would start hitting higher and
higher failure rates, and finally, hit the BUS to ensure some sort of
forward progress.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Dec 13 11:46:17 2025

From Newsgroup: comp.arch

On 12/13/2025 11:12 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >>> referenced, but, no you can't have it right now" in order to strengthen
the guarantees of forward progress.

How does it strengthen the guarantees of forward progress?

The allowance of a NaK is only available under somewhat special circumstances::
a) in Careful mode:: when core can see that all STs have write permission
and data is present, NaKs allow the Modification part to run to
completion.
b) In Slow and Methodical mode:: core can NaK any access to any of its
cache lines--preventing interference.

My guess:
If the requester itself is in an atomic sequence B, it will cancel it.

Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
the event by the time the innocent request shows up again.

This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.

There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.

Instead of some contrived back-off policy--at the failure point one can
read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be.
So, if you are going after a unit of work, you march down the queue WHY
units and then YOU are guaranteed that YOU are the only one after that
unit of work.

Step one. Make sure that a failure means another thread made progress.
strong CAS does this. Don't let it spuriously fail where nothing makes progress... ;^o

Oh my we got a load on the reservation granule, abort all LL/SC in
progress wrt that granule. Of course this assumes that the user that
created the program for it gets things right. For a LL/SC on the PPC it definitely helps where things are aligned and padded up to a reservation granule, not just a l2 cache line. Helps mitigate false sharing causing livelock.

Even in weak CAS, akin to LL/SC. Well, how sensitive is that reservation granule. Can a simple load cause a failure?
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 21:58:07 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/13/2025 11:12 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you
referenced, but, no you can't have it right now" in order to strengthen >>> the guarantees of forward progress.

How does it strengthen the guarantees of forward progress?

The allowance of a NaK is only available under somewhat special circumstances::
a) in Careful mode:: when core can see that all STs have write permission
and data is present, NaKs allow the Modification part to run to
completion.
b) In Slow and Methodical mode:: core can NaK any access to any of its
cache lines--preventing interference.

My guess:
If the requester itself is in an atomic sequence B, it will cancel it.

Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
the event by the time the innocent request shows up again.

This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.

There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.

Instead of some contrived back-off policy--at the failure point one can read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be. So, if you are going after a unit of work, you march down the queue WHY units and then YOU are guaranteed that YOU are the only one after that
unit of work.

Step one. Make sure that a failure means another thread made progress. strong CAS does this. Don't let it spuriously fail where nothing makes progress... ;^o

Absollutely!

WHY is only valid in "slow and methodological" which has strong guarantees
of forward progress--at least 1 thread is making forward progress in S&M.

Spurious has to do with things like "system arbiter buffer overflow" and
is not related to exceptions or interrupts.

Oh my we got a load on the reservation granule, abort all LL/SC in
progress wrt that granule. Of course this assumes that the user that
created the program for it gets things right.

This is why I created NaK in the cache coherence protocol--to strengthen
the guarantee of forward progress.

For a LL/SC on the PPC it definitely helps where things are aligned and padded up to a reservation granule, not just a l2 cache line. Helps mitigate false sharing causing livelock.

Even in weak CAS, akin to LL/SC. Well, how sensitive is that reservation granule. Can a simple load cause a failure?

Innocent LD gets NaKed causing the innocent thread to waste time while
allowing the ATOMIC event to make forward progress.

In my case reservation granule is a cache line {which is the same across
the memory hierarchy--but still allows for implementation defined size}.

For example:: HBM can deliver 1024-bits (soon 2048-bits) in a single beat,
so, for main_memory == HBM it makes sense to align the size of the LLcache
to the width of HBM. Once in LLC, you can parcel it out any way your system prescribes.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 22:03:16 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/13/2025 11:03 AM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:

On 12/8/2025 9:14 AM, Scott Lurndal wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be >>>>>> locked would have to be supplied throughout the system, including >>>>>> across
buffers and bus bridges. It would have to go to the memory coherence >>>>>> point. Otherwise, some other device using a bridge could update the >>>>>> same
address range in the middle of an update.

I may be wrong about this, but I think you have a misconception.-a The >>>>> ESM doesn't *prevent* interference, but it *detect* interference.-a Thus
nothing is required of other cores, no locks, etc.-a If they write to a >>>>> "protected" location, the write is allowed, but the core in the ESM is >>>>> notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor" >>>> (the basis of the Store-Exclusive/Load-Exclusive instructions, which >>>> mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined
range surrounding the target address and the store will fail if any other
agent has modified any byte within the exclusive range.

Any mutation the reservation granule?

I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.

A LD to the granule would cause loss of write permission, causing a long delay to perform SC and greatly increase the probability of interference.

So, you need to create a rule. If you program for my system, you MUST
make sure that everything is properly aligned and padded. Been there,
done that.
Now, think of nefarious agents... I was able to cause damage
to a simple strong CAS loop with another thread(s) mutating the cache
line on purpose, as a stress test... CAS would start hitting higher and higher failure rates, and finally, hit the BUS to ensure some sort of forward progress.

This is why NaKing the interference works better. The interfering agent
takes the timing hit while the ATOMIC event has higher probability of
success.

Also Note: esm is not subject to ABA problem at all--because any interrupt
or exception causes the event to terminate prior to control transfer.

And this is ALSO why there is no thread state associated with esm --
excepting the 16-bit WHY value which is only set if/when there are
no {E,I} control transfers.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Dec 14 05:13:46 2025

From Newsgroup: comp.arch

I am just noticing that the actual physical register name is not needed
until lookup at the reservation stations. In Qupls4 it can be a few
clock cycles before the register lookup is done. So, an incorrect one
could be supplied at the rename stage; it only has to be good enough to
work out dependencies. Would a sequence number based register name work? (Rather than reading a fifo). Then it is a matter of correcting it later.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Dec 15 12:30:09 2025

From Newsgroup: comp.arch

Sure. Of course multi-core systems will not have that hardware guarantee,
at least not on main memory, for performance reasons.

SGI's big MIPS supercomputers did, tho. So maybe they could again in
some future.

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 16 19:47:56 2025

From Newsgroup: comp.arch

On Wed, 12 Nov 2025 11:47:34 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Tue, 11 Nov 2025 21:34:08 -0600
BGB <cr88192@gmail.com> wrote:

Going to/from 128-bit integer adds a few "there be dragons here"
issues regarding performance.

Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
There is also psychological factor at play - your users expect
division and square root to be slower than other primitive FP
operations, so they are not disappointed. Possibly they are even
pleasantly surprised, when they find out that the difference in
throughput between division and multiplication is smaller than factor
20-30 that they were accustomed to for 'double' on their 20 y.o. Intel
and AMD.

Today I tested speed of gcc implementation of Decimal128 (BID-encoded,
of course) on Intel Core i7-14700.
Average time in nsec:
op Add Sub Mul Div
P-Core 33 33 86 76
E-Core 46 48 121 108

Counter-intuitively, division is faster than multiplications.
And both appear much slower than necessary.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Dec 16 17:51:28 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

Today I tested speed of gcc implementation of Decimal128 (BID-encoded,
of course) on Intel Core i7-14700.
Average time in nsec:
op Add Sub Mul Div
P-Core 33 33 86 76
E-Core 46 48 121 108

Counter-intuitively, division is faster than multiplications.
And both appear much slower than necessary.

Interesting. Could you provide the benchmark used?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 16 20:43:49 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

I am just noticing that the actual physical register name is not needed until lookup at the reservation stations. In Qupls4 it can be a few
clock cycles before the register lookup is done. So, an incorrect one
could be supplied at the rename stage; it only has to be good enough to
work out dependencies. Would a sequence number based register name work? (Rather than reading a fifo). Then it is a matter of correcting it later.

The renamer gives you 2 pieces of information::
a) where
b) when
There are various implementations that define where and when differently,
but the important thing is that you KNOW that they represent where and when.

Where: Where is the physical register name (location) which can be in the register file(s), the data path, the reorder buffer(s), or instruction stations.

When has bet a few states:: states prior to when a dependent instruction
can be launched and capture this result, a couple of states when the inst
can be launched ..., states after result has landed in RoB, and a state indicating the value is in the register file.

Where and when interact.

Given where and when one can organize the instruction queueing, data path forwarding, and the pipelining of operands and results. -------------------------------------------------------
Me, personally, I like a physical register file
a) which is logically indexed for reads,
b) performs renaming and operand access simultaneously,
c) which is physically indexed for writes,
d) is "repaired" with a history table of valid bits.
e) so, mispredict repair has 0 cycles of latency.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 17 12:02:12 2025

From Newsgroup: comp.arch

On Tue, 16 Dec 2025 17:51:28 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

Today I tested speed of gcc implementation of Decimal128
(BID-encoded, of course) on Intel Core i7-14700.
Average time in nsec:
op Add Sub Mul Div
P-Core 33 33 86 76
E-Core 46 48 121 108

Counter-intuitively, division is faster than multiplications.
And both appear much slower than necessary.

Interesting. Could you provide the benchmark used?

// tb.cpp
#include <cstdio>
#include <cstdlib>
#include <cstdint>
#include <time.h>
#include <random>
#include <algorithm>

extern "C" {
void uut(void*, const void*, const void*);
};

static inline
uint64_t umulh(uint64_t a,uint64_t b) {
return uint64_t(((unsigned __int128)a * b)>>64);
}

int main(int , char** )
{
const int N_PAIRS = 1000000;
const int N_ITER = 17;
typedef unsigned __int128 u128;
std::vector<u128> src(N_PAIRS*2);
std::mt19937_64 prng(1);
const unsigned EXP_BIAS = 6143;
const unsigned EXP_SHIFT = 113;
for (int i = 0; i < N_PAIRS*2; ++i) {
// generate pseudo-random number in range [1e33:1e34-1]
const uint64_t RNG_LO = (long long)1e17;
const uint64_t RNG_HI = (long long)9e16;
const uint64_t BASE_HI = (long long)1e16;
uint64_t lo = umulh(prng(), RNG_LO); // [0:1e17-1]
uint64_t hi = umulh(prng(), RNG_HI) + BASE_HI; // [1e16:1e17-1]
u128 val = (u128)hi*RNG_LO + lo;
unsigned exp = EXP_BIAS + umulh(prng(), 50) - 25;
const u128 exp_val = (u128)exp << EXP_SHIFT;
src[i] = val | exp_val;
}

std::vector<long long> dt(N_ITER);
for (int it = 0; it < N_ITER; ++it) {
struct timespec t0;
clock_gettime(CLOCK_MONOTONIC, &t0);
u128 dummy1 = 0;
const u128* pSrc = src.data();
for (int i = 0; i < N_PAIRS; ++i) {
u128 rat;
uut(&rat, &pSrc[i*2+0], &pSrc[i*2+1]);
dummy1 ^= rat;
}
if (dummy1 == 42)
printf("Blue Moon\n");
struct timespec t1;
clock_gettime(CLOCK_MONOTONIC, &t1);
dt[it] = (t1.tv_sec - t0.tv_sec)*(long long)(1e9) + (long
long)t1.tv_nsec - (long long)t0.tv_nsec; }
// find median
std::nth_element(&dt[0], &dt[N_ITER/2], &dt[N_ITER]);
long long dt_med = dt[N_ITER/2];
printf("%.1f nsec\n", (double)dt_med / N_PAIRS);

return 0;
}
// end tb.cpp

// gcc_dec128add.c
#include <string.h>

void uut(void* pRes, const void* pA, const void* pB)
{
_Decimal128 a, b, res;
memcpy(&a, pA, sizeof(a));
memcpy(&b, pB, sizeof(b));
res = a + b;
memcpy(pRes, &res, sizeof(res));
}
// end gcc_dec128add.c

// gcc_dec128sub.c
#include <string.h>

void uut(void* pRes, const void* pA, const void* pB)
{
_Decimal128 a, b, res;
memcpy(&a, pA, sizeof(a));
memcpy(&b, pB, sizeof(b));
res = a - b;
memcpy(pRes, &res, sizeof(res));
}
// end gcc_dec128sub.c

// gcc_dec128mul.c
#include <string.h>

void uut(void* pRes, const void* pA, const void* pB)
{
_Decimal128 a, b, res;
memcpy(&a, pA, sizeof(a));
memcpy(&b, pB, sizeof(b));
res = a * b;
memcpy(pRes, &res, sizeof(res));
}
// end gcc_dec128mul.c

// gcc_dec128div.c
#include <string.h>

void uut(void* pRat, const void* pNum, const void* pDen)
{
_Decimal128 den, num, rat;
memcpy(&num, pNum, sizeof(num));
memcpy(&den, pDen, sizeof(den));
rat = num / den;
memcpy(pRat, &rat, sizeof(rat));
}
// end gcc_dec128div.c

Build script
COPT="-O2 -Wall -march=haswell -mtune=skylake"
mkdir -p obj
mkdir -p out
g++ -c $COPT tb.cpp -o obj/tb.o
gcc -c $COPT gcc_dec128add.c -o obj/gcc_dec128add.o
gcc -c $COPT gcc_dec128sub.c -o obj/gcc_dec128sub.o
gcc -c $COPT gcc_dec128mul.c -o obj/gcc_dec128mul.o
gcc -c $COPT gcc_dec128div.c -o obj/gcc_dec128div.o
g++ -s obj/tb.o obj/gcc_dec128add.o -o out/tst_add.exe
g++ -s obj/tb.o obj/gcc_dec128sub.o -o out/tst_sub.exe
g++ -s obj/tb.o obj/gcc_dec128mul.o -o out/tst_mul.exe
g++ -s obj/tb.o obj/gcc_dec128div.o -o out/tst_div.exe

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Dec 17 13:52:10 2025

From Newsgroup: comp.arch

On 12/13/2025 2:03 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

[...]

Take the following algorithm for a semaphore:

https://www.haiku-os.org/legacy-docs/benewsletter/Issue1-26.html
(remember that old one?)

Okay, on the x86/x64 a LOCK XADD will make for a loopless impl. If we
are on another system and that LOCK XADD is some sort of LL/SC loop,
well, that causes damage to my loopless claim... ;^o

Also, big deal!, NOTHING inside the LL/SC can/should touch the
reservation granule of a target, wrt LL/!!!

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Dec 18 21:33:10 2025

From Newsgroup: comp.arch

TonightrCOs quandry. Can a register file write history be used in
reservation stations to reduce the number of register read ports
required on the register file? And would it be smaller, faster than
adding register file ports?

There are a couple of cases:
1) The operands were ready at time of queue.
2) The operands were not ready at time of queue. That means there must
be an outstanding operation that at some point will update the register
file.

Operands that are valid at time of queue are stored in the re-order
buffer temporarily until dispatch. So, it should be possible to update
the operands in the reservation stations either from the instruction
dispatch, or later when the value is written to the register file.

An issue arises between queue and dispatch. If a value was updated in
the register file between queue and dispatch then it will not be latched
by the reservation station because it does not have the instruction yet.
If the reservation station has the instruction already waiting then the register file write port can be used.

Rather than having more register file read ports, I was thinking of
having the reservation stations track (snoop) the register file write
history using a tapped shift register. The only time that must be
accounted for is between queue and dispatch. So, assuming that
instructions dispatched within a certain time frame, then using write
history could work.

I was thinking a 64 deep shift register with several taps could be used
to supply written values.

It is going to be at least two clock cycles between queue and dispatch
until operands/instructions are updated in the reservation stations. So,
the first tap would be at shift 3. Then maybe taps at 6, 12, 24, and 64 clocks.

There is not usually a huge amount of time between queue and dispatch,
unless the functional unit is busy. I think the longest operation is <80 clocks, so in worst case a 128-deep shift register should work.

LUTs can be turned into shift registers up to 64 bits IIRC. With four
64-bit write ports and 10 bit register tag about 300 LUTs would be
required for each shift level.

The history could be shared between several reservation stations.

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Widgit
  Sun Jan 11 18:29:52 2026
  from New Zealand via Telnet
- Geek2
  Sun Jan 11 14:07:03 2026
  from Euclid, Oh via Telnet
- Geek2
  Sun Jan 11 11:15:24 2026
  from Euclid, Oh via Telnet
- Crackerchest
  Sun Jan 11 08:12:39 2026
  from Usa via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	54
Nodes:	6 (0 / 6)
Uptime:	17:44:29
Calls:	742
Files:	1,218
D/L today:	4 files (8,203K bytes)
Messages:	184,414
Posted today:	1

Re: Interrupt enable down-count

Who's Online

Recent Visitors

System Info