On 11/29/2025 6:29 AM, Robert Finch wrote:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.
TradeoffsrCa
I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.
ComplexrCa
A simple alternative that I have seen is to have an instruction that
enables interrupts and jumps to somewhere, probably either the
interrupted code or the dispatcher that might do a full context switch.
The ISR would issue this instruction when it has saved everything that
is necessary to handle the interrupt and thus could be interrupted
again. This minimized the time interrupts are locked out without the
need for an arbitrary timer, etc.
Robert Finch <robfi680@gmail.com> posted:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred
because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be
committed because the IRQs got disabled in the meantime. If the CPU were
allowed to accept another IRQ right away, it could get stuck in a loop
flushing the pipeline and reloading with the ISR routine code instead of
progressing through the code where IRQs were disabled.
The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
are allowed to retire (apace) and new instructions are inserted from
the interrupt service point.
can deliver their results to their registers, and update -|Architectural state they "own", there is no reason to flush--AND--no corresponding
reason to delay "taking" the interrupt.
At the -|Architectural level, you, the designer, see both the front
and the end of the pipeline, you can change what goes in the front
and allow what was already in the pipe to come out the back. This
requires dragging a small amount of information down the pipe, much
like multi-threaded CPUs.
I could create a control register for this count and allow it to beMake the problem "go away". You will be happier in the end.
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU
would advance in 40 instruction burps. Alternating between fetching ISR
instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.
TradeoffsrCa
I suppose I could have the CPU increase the down-count if it is looping
around fetching ISR instructions. The down-count would be reset to the
minimum again once an interrupt enable instruction is executed.
ComplexrCa
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.
TradeoffsrCa
I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.
ComplexrCa
On 2025-11-29 2:05 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred
because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs >> are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be >> committed because the IRQs got disabled in the meantime. If the CPU were >> allowed to accept another IRQ right away, it could get stuck in a loop
flushing the pipeline and reloading with the ISR routine code instead of >> progressing through the code where IRQs were disabled.
The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
are allowed to retire (apace) and new instructions are inserted from
the interrupt service point.
That is how Qupls is working too. The issue is what happens when the instruction in the pipe before the ISR disables the interrupt. Then the
ISR instructions need to be flushed.
As long as the instructions "IN" the pipe
can deliver their results to their registers, and update -|Architectural state they "own", there is no reason to flush--AND--no corresponding
reason to delay "taking" the interrupt.
That is the usual case for Qupls too when there is an interrupt.
At the -|Architectural level, you, the designer, see both the front
and the end of the pipeline, you can change what goes in the front
and allow what was already in the pipe to come out the back. This
requires dragging a small amount of information down the pipe, much
like multi-threaded CPUs.
Yes, the IRQ info is being dragged down the pipe.
I could create a control register for this count and allow it to beMake the problem "go away". You will be happier in the end.
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU
would advance in 40 instruction burps. Alternating between fetching ISR
instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.
TradeoffsrCa
I suppose I could have the CPU increase the down-count if it is looping
around fetching ISR instructions. The down-count would be reset to the
minimum again once an interrupt enable instruction is executed.
ComplexrCa
The interrupt mask is set at fetch time to disable lower priority interrupts. I suppose disabling of interrupts by the OS could simply be ignored. The interrupt could only be taken if it is a higher priority
than the current level.
I had thought the OS might have good reason to disable interrupts. But
maybe I am making things too complex.
Robert Finch wrote:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.
TradeoffsrCa
I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.
ComplexrCa
You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.
I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending exceptions in-flight they all are allowed to finish and the state to settle down.
Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.
Robert Finch wrote:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU
were allowed to accept another IRQ right away, it could get stuck in a
loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.
TradeoffsrCa
I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.
ComplexrCa
You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.
I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending
exceptions
in-flight they all are allowed to finish and the state to settle down.
Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.
On 2025-11-29 4:10 p.m., EricP wrote:
Robert Finch wrote:
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU
were allowed to accept another IRQ right away, it could get stuck in a
loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.
TradeoffsrCa
I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.
ComplexrCa
You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.
The down count is counting down only when the front-end of the pipeline advances, instructions are sure to be loaded.
I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt enable or disable instructions, or branch mispredicts, or pending exceptions
in-flight they all are allowed to finish and the state to settle down.
Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.
The base down count increases every time the IRQ is found at the commit stage. If the base down count is too large (stuck interrupt) then an exception is processed. For instance if interrupts were disabled for
1000 clocks.
I think the mechanism could work, complicated though.
Treating the DI as an exception, as mentioned in another post would also work. It is a matter then of flushing the instructions between the DI
and ISR.
Thomas Koenig <tkoenig@netcologne.de> writes:
(Looking at your
code, it also does not seem to be self-sufficient, at least the
numerous SKIP4 statements require something else).
If you want to assemble the resulting .S file, it's assembled once
with
-DSKIP4= -Dgforth_engine2=gforth_engine
and once with
-DSKIP4=".skip 4"
(on Linux-GNU AMD64, the .skip assembler directive is autoconfigured
and may be different on other platforms).
My assumption is that the control flow is confusing gcc.
My guess is the same.
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-29 4:10 p.m., EricP wrote:
Robert Finch wrote:The down count is counting down only when the front-end of the pipeline
I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.
The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU >>>> were allowed to accept another IRQ right away, it could get stuck in a >>>> loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.
I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.
It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.
TradeoffsrCa
I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.
ComplexrCa
You are using this timer to predict the delay for draining the pipeline. >>> It would only take a read of a slow IO device register to exceed it.
advances, instructions are sure to be loaded.
I was thinking a simple and cheap way would be to use a variation of the >>> single-step mechanism. An interrupt request would cause Decode to emit a >>> special uOp with the single-step flag set and then stall, to allow theThe base down count increases every time the IRQ is found at the commit
pipeline to drain the old stream before accepting the interrupt and
redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending
exceptions
in-flight they all are allowed to finish and the state to settle down.
Pipelining interrupt delivery looks possible but gets complicated and
expensive real quick.
stage. If the base down count is too large (stuck interrupt) then an
exception is processed. For instance if interrupts were disabled for
1000 clocks.
I think the mechanism could work, complicated though.
Treating the DI as an exception, as mentioned in another post would also
work. It is a matter then of flushing the instructions between the DI
and ISR.
Which is no different than flushing instructions after a mispredicted branch.
Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot simpler.
Robert Finch <robfi680@gmail.com> schrieb:
Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works
almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot
simpler.
What is the expected delay until an interrupt is delivered?
On 2025-11-30 5:10 a.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works
almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot >>> simpler.
What is the expected delay until an interrupt is delivered?
I set the timing to 16 clocks which is about 64 (or more) instructions.
Did not want to go much over 1% the number of instructions executed.
Not every instruction inserts a poll, so sometimes a poll is lacking.
IDK how well it will work. Making it an instruction means it might also
be used by software.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Both our guesses were wrong, and Scott (I think) was on the right
track - this is a signed / unsigned issue. A reduced test case is
void bar(unsigned long, long);
void foo(unsigned long u1)
{
long u3;
u1 = u1 / 10;
u3 = u1 % 10;
bar(u1,u3);
}
This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .
100 labels in symbols. This may appear strange, but gcc generallytends to produce good code in relatively short time for Gforth (while
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Both our guesses were wrong, and Scott (I think) was on the right
track - this is a signed / unsigned issue. A reduced test case is
void bar(unsigned long, long);
void foo(unsigned long u1)
{
long u3;
u1 = u1 / 10;
u3 = u1 % 10;
bar(u1,u3);
}
Assigning to u1 changed the meaning, as Andrew Pinski noted;
so the
jury is still out on what the actual problem is.
This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .
and a revised one at
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122919>
(The announced attachment is not there yet.)
The latter case is interesting, because real_ca and spc became global,
and symbols[] is still local, and no assignment to real_ca happens
inside foo().
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Z|+rich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
So they can properly simluate a pipelined processor?
Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.
It's useful for code optimization, as well.
In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.
Processor pipelines are not the basics of what a CS graduate is doing.
They are an implementation detail in computer engineering.
Which affect the performance of the software created by the
software engineer (CS graduate).
A few more examples where compilers are not as good as even I expected:
Just today, I compiled
u4 = u1/10;
u3 = u1%10;
(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):
movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9
The major difference is that in the left context, u3 is stored into
memory (at 0x8(%r13)), while in the right context, it stays in a
register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
again.
Sort of emphasizes that programmers need to understand the
underlying hardware.
What were u1, u3 and u4 declared as?
In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.
Thomas Koenig <tkoenig@netcologne.de> writes:
In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.
I have now done a manual reduction myself; essentially I left only the
3 variants of the VM instruction that performs 10/, plus all the surroundings, and I added code to ensure that spTOS, spb, and spc are
not dead. You find the result at
http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i
ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Z|a-+rich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
So they can properly simluate a pipelined processor?
Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.
It's useful for code optimization, as well.
In what way?
In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.
Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?
If not, what is it about pipelined processors
that would require CS graduates to know about them?
Processor pipelines are not the basics of what a CS graduate is doing. >>They are an implementation detail in computer engineering.
Which affect the performance of the software created by the
software engineer (CS graduate).
By a constant factor; and the software creator does not need to know
that the CPU that executes instructions at 2 CPI (486) instead of at
10 CPI (VAX-11/780) is pipelined; and these days both the 486 and the
VAX are irrelevant to software creators.
A few more examples where compilers are not as good as even I expected:
Just today, I compiled
u4 = u1/10;
u3 = u1%10;
(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):
movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9
The major difference is that in the left context, u3 is stored into >>memory (at 0x8(%r13)), while in the right context, it stays in a >>register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
again.
Sort of emphasizes that programmers need to understand the
underlying hardware.
I am the programmer of the code shown above. In what way would better knowledge of the hardware made me aware that gcc would produce
suboptimal code in some cases?
--- Synchronet 3.21a-Linux NewsLink 1.2What were u1, u3 and u4 declared as?
unsigned long (on that platform).
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
I recently heard that CS graduates from ETH Z|a-+rich had heard about >>>>>>>> pipelines, but thought it was fetch-decode-execute.
Why would a CS graduate need to know about pipelines?
So they can properly simluate a pipelined processor?
Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.
It's useful for code optimization, as well.
In what way?
In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.
Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?
You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!
If not, what is it about pipelined processors
that would require CS graduates to know about them?
How execution order disturbs things like program order and memory order.
That is how and when they need to insert Fences in their multi-threaded
code.
I am the programmer of the code shown above. In what way would better
knowledge of the hardware made me aware that gcc would produce
suboptimal code in some cases?
Reading and thinking about the asm-code and running the various code sequences enough times that you can measure which is better and which
is worse. That is the engineering part of software Engineering.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.
I have now done a manual reduction myself; essentially I left only the
3 variants of the VM instruction that performs 10/, plus all the
surroundings, and I added code to ensure that spTOS, spb, and spc are
not dead. You find the result at
http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i
Do you have an example which tests the codepath taken for the
offending piece of code,
so it is possible to further reduce this
case automatically? The example is still quite big (>13000 lines).
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:
scott@slp53.sl.home (Scott Lurndal) writes:
In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.
Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?
You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!
If not, what is it about pipelined processors
that would require CS graduates to know about them?
How execution order disturbs things like program order and memory order.
That is how and when they need to insert Fences in their multi-threaded >code.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:
scott@slp53.sl.home (Scott Lurndal) writes:
In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.
Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?
You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!
Good to know, but does not answer the question.
If not, what is it about pipelined processors
that would require CS graduates to know about them?
How execution order disturbs things like program order and memory order. >That is how and when they need to insert Fences in their multi-threaded >code.
And the relevance of pipelined processors for that issue is what?
Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.
If you implement per-CPU caches and multiple memory controllers as shoddily
as possible while providing features for programs to slow themselves
down heavily in order to get memory-ordering guarantess, then you get
a weak memory model; slightly less shoddy, and you get a "strong" memory model. Processor pipelines have no relevance here.
And, as Niklas Holsti observed, dealing with memory-ordering
shenanigans is something that a few specialists do; no need for others
to know about the memory model, except that common CPUs unfortunately
do not implement sequential consistency.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.
And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}
And only after several languages built their own ATOMIC primitives, so
the programmers could remain ignorant. But this also ties the hands of
the designers in such a way that performance grows ever more slowly
with more threads.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.
And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}
Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.
The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.
In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.
If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Memory-ordering shenanigans come from the unholy alliance ofAnd without the SuperComputer attitude, you sell 0 parts.
cache-coherent multiprocessing and the supercomputer attitude.
{Remember how we talk about performance all the time here ?}
Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.
The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.
In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.
If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and
imprecise exceptions, if you compile with trapb, you get slowness and
precise exceptions. I then measured SPEC 95 compiled without and with
trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
there was hardly any difference; I believe that trapb is a noop on the
21264. Here's the SPECfp_base95 numbers:
with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264
So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.
And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}
Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.
The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.
In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.
If hardware designers put their mind to it, they could make sequential consistency perform well,
probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
A similar case: Alpha includes a trapb instruction (an exceptionmoderate slowdown
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and
imprecise exceptions, if you compile with trapb, you get slowness and
precise exceptions. I then measured SPEC 95 compiled without and with
trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
there was hardly any difference; I believe that trapb is a noop on the
21264. Here's the SPECfp_base95 numbers:
with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264slowdown has disappeared.
So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.
And only after several languages built their own ATOMIC primitives, so
the programmers could remain ignorant. But this also ties the hands of
the designers in such a way that performance grows ever more slowly
with more threads.
Maybe they could free their hands by designing for a
sequential-consistency interface, just like designing for a simple sequential-execution model without EPIC features freed their hands to
design microarchitectural features that allowed ordinary code to
utilize wider and wider OoO cores profitably.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Memory-ordering shenanigans come from the unholy alliance ofAnd without the SuperComputer attitude, you sell 0 parts.
cache-coherent multiprocessing and the supercomputer attitude.
{Remember how we talk about performance all the time here ?}
Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.
The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.
In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and "strong" memory models over sequential consistency.
If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and imprecise exceptions, if you compile with trapb, you get slowness and precise exceptions. I then measured SPEC 95 compiled without and with trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264 there was hardly any difference; I believe that trapb is a noop on the 21264. Here's the SPECfp_base95 numbers:
with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264
So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.
The 21264 Hardware Reference Manual says TRAPB (general exception barrier) and EXCB (floating point control register barrier) are both NOP's
internally, are tossed at decode, and don't even take up an
instruction slot.
The purpose of the EXCB is to synchronize pipeline access to the
floating point control and status register with FP operations.
In the worst case this stalls until the pipeline drains.
I wonder how much logic it really saved allowing imprecise exceptions
in the InO 21064 and 21164?
Conversely, how much did it cost to deal
with problems caused by leaving these interlocks off?
The cores have multiple, parallel pipelines for int, lsq, fadd and fmul. Without exception interlocks, each pipeline only obeys the scoreboard
rules for when to writeback its result register: WAW and WAR.
That allows a younger, faster instruction to finish and write its register before an older, slower instruction. If that older instruction then throws
an exception and does not write its register then we can see the out of
order register writes.
For register file writes to be precise in the presence of exceptions
requires each instruction look ahead at the state of all older
instructions *in all pipelines*.
Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.
A writeback can occur if there are no WAW or WAR dependencies,
and all older uOps are Resolved_Normal.
Just off the top of my head, in addition to the normal scoreboard,
a FIFO buffer with a priority selector could be used to look ahead
at all older uOps and check their status,
and allow or stall uOp
writebacks and ensure registers always appear precise.
Which really doesn't look that expensive.
Is there something I missed, or would that FIFO suffice?
The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for >protection or translation of the address.
It would be quite slow to have the instructions reissued and percolate
down the cache access again.
Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for protection or translation of the address.
It would be quite slow to have the instructions reissued and percolate
down the cache access again.
This should only be an issue if an unaligned access crosses a memory
page boundary.
The instruction causes an alignment fault if a page cross boundary is detected.
Robert Finch <robfi680@gmail.com> posted:
Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for
protection or translation of the address.
You can determine is an access is misaligned "enough" to warrant two
trips down the pipe.
a) crosses cache width
b) crosses page boundary
Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.
It would be quite slow to have the instructions reissued and percolate
down the cache access again.
An AGEN-like adder has 11-gates of delay, you can determine misaligned
by 4-gates of delay.
This should only be an issue if an unaligned access crosses a memory
page boundary.
Here you need to access the TLB twice.
The instruction causes an alignment fault if a page cross boundary is
detected.
probably not as wise as you think.
On Mon, 01 Dec 2025 07:56:37 GMT[snip]
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
If hardware designers put their mind to it, they could make sequential
consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong"
ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the
slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?
More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?
IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.
In article <20251201132322.000051a5@yahoo.com>,
Michael S <already5chosen@yahoo.com> wrote:
On Mon, 01 Dec 2025 07:56:37 GMT[snip]
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
If hardware designers put their mind to it, they could make sequential
consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong"
ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the
slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).
Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?
More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?
IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.
Compiler writers have hidden behind the hardware complexity to make
writing source code that is thread-safe much harder than it should be.
If you have to support placing hardware barriers, then the languages
can get away with needing lots of <atomic> qualifiers everywhere, even
on systems which don't need barriers, making the code more complex. And
language purists still love to sneer at volatile in C-like languages as "providing no guarantees, and so is essentially useless"--when volatile providing no guarantees is a language and compiler choice, not something written in stone.
A bunch of useful algorithms could be written with
merely "volatile" like semantics, but for some reason, people like the line-noise-like junk of C++ atomics, where rather than thinking in terms
of the algorithm, everyone needs to think in terms of release and acquire. (Which are weakly-ordering concepts).
Kent
kegs@provalid.com (Kent Dickey) posted:
Thread-safe, by definition, is (IS) harder.
language purists still love to sneer at volatile in C-like languages as
"providing no guarantees, and so is essentially useless"--when volatile
providing no guarantees is a language and compiler choice, not something
written in stone.
The problem with volatile is that all it means is the every time a volatile variable is touched, the code has to have a corresponding LD or ST. The HW ends up knowing nothing about the value's volativity and ends up in no position to help.
"volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
and I don't think that C with just volatile gives you such guarantees.
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that >affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
and I don't think that C with just volatile gives you such guarantees.--- Synchronet 3.21a-Linux NewsLink 1.2
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
and I don't think that C with just volatile gives you such guarantees.
- anton
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically, >>> it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
Even with a single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.
and I don't think that C with just volatile gives you such guarantees.
- anton
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems.-a Basically, >>>> it only works at the C abstract machine level - it does nothing that
affects the hardware.-a So volatile writes are ordered at the C level, >>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
Even with a
single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.
and I don't think that C with just volatile gives you such guarantees.
- anton
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.
Robert Finch <robfi680@gmail.com> writes:
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.
My impression is that modern implementations deal with this kind of
stuff at decoding or in the renamer. That should reduce the number of
places where it is special-cased to one, but it means that the uops
have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
in the uops.
- anton
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Even with a
single core system you can have pre-emptive multi-threading, or at least
interrupt routines that may need to cooperate with other tasks on data.
and I don't think that C with just volatile gives you such guarantees. >>>>
- anton
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically, >> >>> it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
On 12/5/2025 12:54 PM, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Any issues with live lock in here?
[...]--- Synchronet 3.21a-Linux NewsLink 1.2
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can substitute small constants for register values.
Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
to zero only in the agen units. So, the bypass is only in a couple of places. Otherwise r0 can be used as an ordinary register. Load / store instructions cannot use r0 as a GPR then, but it works for the PowerPC.
I hit this trying to decide where to bypass another register code to represent the instruction pointer. In that case I think it may be better
to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.
Anyone got a summary of bypassing r0 in different architectures?
Robert Finch <robfi680@gmail.com> writes:
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.
My impression is that modern implementations deal with this kind of
stuff at decoding or in the renamer. That should reduce the number of
places where it is special-cased to one, but it means that the uops
have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
in the uops.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?
It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't require additional hardware.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Of course. It is interesting to speculate about possible features of an architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).
Even with a
single core system you can have pre-emptive multi-threading, or at least >> interrupt routines that may need to cooperate with other tasks on data.
and I don't think that C with just volatile gives you such guarantees. >>>>
- anton
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >> >>> guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that >> >>> affects the hardware. So volatile writes are ordered at the C level, >> >>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency". >> >> If hardware guaranteed sequential consistency, volatile would provide >> >> guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >> >
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.
Individual atomic "Remove Element" and "Insert/Append Element"[*], yes. Combined? Too inflexible.
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.--- Synchronet 3.21a-Linux NewsLink 1.2
Atomic add/sub are useful. The other atomic math operations (min, max, etc) may be useful in certain cases as well.
scott@slp53.sl.home (Scott Lurndal) posted:
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.
The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??
Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.
BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.
The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??
Nothing comes immediately to mind.
Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.
BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
Most extant SMP processors provide a compare and swap operation, which--- Synchronet 3.21a-Linux NewsLink 1.2
are widely supported by the common compilers that support the C and C++ threading functionality.
Robert Finch <robfi680@gmail.com> posted:
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.
Often the use of R0 as an operand causes the calculation to be degenerate. That is, R0 is not needed at all.
ADD R9,R7,R0 // is a MOV instruction
AND R9,R7,R0 // is a CLR instruction
So, you don't have to treat R0 in bypassing, but as Operand processing.
Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
to zero only in the agen units. So, the bypass is only in a couple of
places. Otherwise r0 can be used as an ordinary register. Load / store
instructions cannot use r0 as a GPR then, but it works for the PowerPC.
AGEN Rbase ==R0 implies Rbase = IP
AGEN Rindex==R0 implies Rindex = 0
I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better
to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.
Anyone got a summary of bypassing r0 in different architectures?
These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random register.
On 2025-12-06 12:29 p.m., MitchAlsup wrote:
We dont want no degenerating instructions.
Robert Finch <robfi680@gmail.com> posted:
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.
Often the use of R0 as an operand causes the calculation to be
degenerate.
That is, R0 is not needed at all.
-a-a-a-a-a ADD-a-a R9,R7,R0-a-a-a-a-a-a-a // is a MOV instruction
-a-a-a-a-a AND-a-a R9,R7,R0-a-a-a-a-a-a-a // is a CLR instruction
So, you don't have to treat R0 in bypassing, but as Operand processing.Qupls now follows a similar paradigm.
Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple of
places. Otherwise r0 can be used as an ordinary register. Load / store
instructions cannot use r0 as a GPR then, but it works for the PowerPC.
AGEN Rbase ==R0 implies Rbase-a = IP
AGEN Rindex==R0 implies Rindex = 0
-aRbase = r0 bypasses to 0
-aRindex = r0 bypasses to 0
-aRbase = r31 bypasses to IP
Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.
I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.
Anyone got a summary of bypassing r0 in different architectures?
These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random
register.
Qupls has IP offset constant loading.
On 2025-12-06 6:33 p.m., Robert Finch wrote:
On 2025-12-06 12:29 p.m., MitchAlsup wrote:
We dont want no degenerating instructions.
Robert Finch <robfi680@gmail.com> posted:
Tradeoffs bypassing r0 causing more ISA tweaks.
It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000 >>> LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.
Often the use of R0 as an operand causes the calculation to be
degenerate.
That is, R0 is not needed at all.
-a-a-a-a-a ADD-a-a R9,R7,R0-a-a-a-a-a-a-a // is a MOV instruction
-a-a-a-a-a AND-a-a R9,R7,R0-a-a-a-a-a-a-a // is a CLR instruction
So, you don't have to treat R0 in bypassing, but as Operand processing. >>> Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple ofQupls now follows a similar paradigm.
places. Otherwise r0 can be used as an ordinary register. Load / store >>> instructions cannot use r0 as a GPR then, but it works for the PowerPC. >>AGEN Rbase ==R0 implies Rbase-a = IP
AGEN Rindex==R0 implies Rindex = 0
-aRbase = r0 bypasses to 0
-aRindex = r0 bypasses to 0
-aRbase = r31 bypasses to IP
Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.
I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.
Anyone got a summary of bypassing r0 in different architectures?
These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random >> register.
Qupls has IP offset constant loading.
No sooner than having updated the spec, I added two more opcodes to
perform loads and stores using IP relative addressing. That way, no need
to use r31, leaving 31 registers completely general purpose. I am
wanting to cast some aspects of the ISA in stone, or it will never get anywhere.
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.
Scott Lurndal <scott@slp53.sl.home> schrieb:
Yes, you can add special instructions. However, the compilers
will be unlikely to generate them, thus applications that desired
the generation of such an instruction would need to create a
compiler extension (like gcc __builtin functions) or inline
assembler which would then make the program that uses the
capability both compiler specific _and_ hardware specific.
This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for microarchitecture specific features, if the need for performance
gain is large enough.
A primary example is Intel TSX, which is (was?) required by SAP.
POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.
Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.
Most extant SMP processors provide a compare and swap operation,
which are widely supported by the common compilers that support the
C and C++ threading functionality.
It seems there is a market for going beyond compare and swap.
scott@slp53.sl.home (Scott Lurndal) posted:
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?
Scott Lurndal <scott@slp53.sl.home> schrieb:
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for >microarchitecture specific features, if the need for performance
gain is large enough.
A primary example is Intel TSX, which is (was?) required by SAP.
POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.
Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.
On Sun, 7 Dec 2025 09:30:50 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Yes, you can add special instructions. However, the compilers
will be unlikely to generate them, thus applications that desired
the generation of such an instruction would need to create a
compiler extension (like gcc __builtin functions) or inline
assembler which would then make the program that uses the
capability both compiler specific _and_ hardware specific.
This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for
microarchitecture specific features, if the need for performance
gain is large enough.
A primary example is Intel TSX, which is (was?) required by SAP.
By SAP HANA, I assume.
Not sure for how long it was true. It sounds very unlikely that it is
still true.
POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.
Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.
Most extant SMP processors provide a compare and swap operation,
which are widely supported by the common compilers that support the
C and C++ threading functionality.
It seems there is a market for going beyond compare and swap.
TSX is close to dead.
ARM's TME was announced almost 5 years ago. AFAIK, there were no implementations. Recently ARM said that FEAT_TME is obsoleted. It sounds
like the whole thing is dead, but there is small chance that I am misinterpreting.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
In my 40 years of SMP OS/HV work, I don't recall aThe question is not would "MoveElement" be useful, but
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??
Nothing comes immediately to mind.
Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?
More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?
IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.
On 12/5/2025 11:10 AM, David Brown wrote:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
It's strange that double-word compare and swap (DWCAS), where the words
are contiguous. Well, I have seen compilers say its not lock-free even
on a x86. for a 32 bit system we have cmpxchg8b. For a 64 bit system cmpxchg16b. But the compiler reports not lock free. Strange.
using cmpxchg instead of xadd: https://forum.pellesc.de/index.php?topic=7167.0
trying to tell me that a DWCAS is not lock free: https://forum.pellesc.de/index.php?topic=7311.msg27764#msg27764
This should be lock-free on an x86, even x64:
struct ct_proxy_dwcas
{
-a-a-a struct ct_proxy_node* node;
-a-a-a intptr_t count;
};
some of my older code:
AC_SYS_APIEXPORT
int AC_CDECL
np_ac_i686_atomic_dwcas_fence
( void*,
-a void*,
-a const void* );
np_ac_i686_atomic_dwcas_fence PROC
-a push esi
-a push ebx
-a mov esi, [esp + 16]
-a mov eax, [esi]
-a mov edx, [esi + 4]
-a mov esi, [esp + 20]
-a mov ebx, [esi]
-a mov ecx, [esi + 4]
-a mov esi, [esp + 12]
-a lock cmpxchg8b qword ptr [esi]
-a jne np_ac_i686_atomic_dwcas_fence_fail
-a xor eax, eax
-a pop ebx
-a pop esi
-a ret
np_ac_i686_atomic_dwcas_fence_fail:
-a mov esi, [esp + 16]
-a mov [esi + 0],-a eax;
-a mov [esi + 4],-a edx;
-a mov eax, 1
-a pop ebx
-a pop esi
-a ret
np_ac_i686_atomic_dwcas_fence ENDP
Even with a single core system you can have pre-emptive multi-
threading, or at least interrupt routines that may need to cooperate
with other tasks on data.
and I don't think that C with just volatile gives you such guarantees. >>>>
- anton
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/5/2025 12:54 PM, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.
However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Any issues with live lock in here?
A bit hard to tell because of 2 things::
a) I carry around the thread priority and when interference occurs,
the higher priority thread wins--ties the already accessed thread wins. b) live-lock is resolved or not by the caller to these routines, not
these routines themselves.
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction.-a Or am I missing something here?
Scott Lurndal <scott@slp53.sl.home> schrieb:
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.
Interestingly, Linux restartable sequences allow for acquisition of
a lock with no membarrier or atomic instruction on the fast path,
at the cost of a syscall on the slow path (no free lunch...)
But you also need assembler to do it.
An example is, for example, at https://gitlab.ethz.ch/extra_projects/cpu-local-lock
scott@slp53.sl.home (Scott Lurndal) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.
However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.
The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??
Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.
BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.
Atomic add/sub are useful. The other atomic math operations (min, max, etc) >> may be useful in certain cases as well.
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.
However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?
In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.
These is no bus!
The esmLOCKload causes the <translated> address to be 'monitored'
for interference, and to announce participation in the ATOMIC event.
The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
AND sets up a default control point (This instruction itself) so that
if interference is detected at esmLOCKstore control is transferred to
that control point.
So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.
There is a branch-on-interference instruction that
a) does what it says,
b) sets up an alternate atomic control point.
It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't
require additional hardware.
I am using the "Miss Buffer" as the point of monitoring for interference.
a) it already has to monitor "other hits" from outside accesses to deal
with the coherence mechanism.
b) that esm additions to Miss Buffer are on the order of 2%
c) there are other means to strengthen guarantees of forward progress.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Of course. It is interesting to speculate about possible features of an
architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).
Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>
and I don't think that C with just volatile gives you such guarantees. >>>>>>
- anton
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provideYou describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.
enough
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware.-a So volatile writes are ordered at the C >>>>>>> level,
but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core
systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>
The functions below rely on more than that - to make the work, as far
as I can see, you need the first "esmLOCKload" to lock the bus and
also lock the core from any kind of interrupt or other pre-emption,
lasting until the esmLOCKstore instruction.-a Or am I missing something
here?
Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.
On 08/12/2025 00:17, Chris M. Thomasson wrote:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>> enoughYou describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>> that
affects the hardware.-a So volatile writes are ordered at the C >>>>>>>> level,
but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>
consistency".
If hardware guaranteed sequential consistency, volatile would
provide
guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.
However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core
systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>>
The functions below rely on more than that - to make the work, as far
as I can see, you need the first "esmLOCKload" to lock the bus and
also lock the core from any kind of interrupt or other pre-emption,
lasting until the esmLOCKstore instruction.-a Or am I missing
something here?
Lock the BUS? Only when shit hits the fan. What about locking the
cache line? Actually, I think we can "force" an x86/x64 to lock the
bus if we do a LOCK'ed RMW on memory that straddles cache lines?
Yes, I meant "lock the bus" - but I might have been overcautious.
However, it seems there is a hidden hardware loop here - the
esmLOCKstore instruction can fail and and the processor jumps back to
the first esmLOCKload instruction.-a With that, you don't need to block other code from running or accessing the bus.
<snip>
BOOLEAN RemoveElement( Element *fr )
{
-a-a-a-a fn = esmLOCKload( fr->next );
-a-a-a-a fp = esmLOCKload( fr->prev );
-a-a-a-a esmLOCKprefetch( fn );
-a-a-a-a esmLOCKprefetch( fp );
-a-a-a-a if( !esmINTERFERENCE() )
-a-a-a-a {
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a fp->next = fn;
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a fn->prev = fp;
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a fr->prev = NULL;
-a-a-a-a esmLOCKstore( fr->next,-a NULL );
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a return TRUE;
-a-a-a-a }
-a-a-a-a return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally
sufficient.
Yes, you can add special instructions.-a-a However, the compilers will
be unlikely
to generate them, thus applications that desired the generation of
such an
instruction would need to create a compiler extension (like gcc
__builtin functions)
or inline assembler which would then make the program that uses the
capability both compiler
specific _and_ hardware specific.
So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.
I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.
I am assuming the esmLockStore() just unlocks what was previously locked
and the stores have already happened by that time.
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across
buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same
address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.You describe in many words and not really to the point what can be >>>>> explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>
However, for concurrent manipulations of data structures, one wants >>>>> atomic operations beyond load and store (even on single-core systems), >>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>
The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting until the esmLOCKstore instruction.-a Or am I missing something here?
Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?
[...]
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >> >>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >> >>>>>> affects the hardware.-a So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever.
You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency". >> >>>>> If hardware guaranteed sequential consistency, volatile would provide >> >>>>> guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.
However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >> >>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction.
MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >> >>
The functions below rely on more than that - to make the work, as far as >> > I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction.-a Or am I missing something here?
Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?
In the My 66000 case, Mem References can lock up to 8 cache lines.
On 06/12/2025 18:44, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically,You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.
it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as far as >> I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?
In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.
That's what I assumed.
Certainly there are situations where it can be helpful to have longer
atomic reads and writes. I am not so sure about allowing 8 KB atomic accesses, especially in a system with multiple cores - that sounds like letting user programs DoS everything else on the system.
These is no bus!
I think there's a typo or some missing words there?
The esmLOCKload causes the <translated> address to be 'monitored'
for interference, and to announce participation in the ATOMIC event.
The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
AND sets up a default control point (This instruction itself) so that
if interference is detected at esmLOCKstore control is transferred to
that control point.
So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.
If I understand you correctly here, you basically have a "load-reserve / store-conditional" sequence as commonly found in RISC architectures, but
you have the associated loop built into the hardware?
I can see that potentially improving efficiency, but I also find it very difficult to
read or write C code that has hidden loops. And I worry about how it
would all work if another thread on the same core or a different core
was running similar code in the middle of these sequences. It also
reduces the flexibility - in some use-cases, you want to have software limits on the number of attempts of a lr/sc loop to detect serious synchronisation problems.
There is a branch-on-interference instruction that
a) does what it says,
b) sets up an alternate atomic control point.
It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't >> require additional hardware.
I am using the "Miss Buffer" as the point of monitoring for interference. a) it already has to monitor "other hits" from outside accesses to deal
with the coherence mechanism.
b) that esm additions to Miss Buffer are on the order of 2%
c) there are other means to strengthen guarantees of forward progress.
Compare Double, Swap Double::
BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}
Move Element from one place to another:
BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}
So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.
Of course. It is interesting to speculate about possible features of an >> architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).
Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>
and I don't think that C with just volatile gives you such guarantees. >>>>>>
- anton
<snip>
BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}
[*] For which atomic compare-and-swap or atomic swap is generally sufficient.
Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.
So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.
I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.
I am assuming the esmLockStore() just unlocks what was previously locked
and the stores have already happened by that time.
It would seem that esmINTERFERENCE() would indicate that everybody with access out to the coherence point has agreed to the locked area? Does
that require that all devices respect the esmINTERFERENCE()?
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as >> a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across >> buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same >> address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.
esmINTERFERENCE seems to require multiple of these exclusive blocks
to cover non-contiguous address ranges, which on first blush leads
me to worry both about deadlock situations and starvation issues.
ERROR "unexpected byte sequence starting at index 736: '\xC2'" while decoding:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:Lock the BUS? Only when shit hits the fan. What about locking the cache >> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >> do a LOCK'ed RMW on memory that straddles cache lines?
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide enoughYou describe in many words and not really to the point what can be >> >>>>> explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency".
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware.|e-a So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever. >> >>>>>
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.
However, for concurrent manipulations of data structures, one wants >> >>>>> atomic operations beyond load and store (even on single-core systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes >> >>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32|e-a|e-a DWs|e-a|e-a as a single ATOMIC instruction.
MM|e-a|e-a|e-a|e-a|e-a can MOV|e-a|e-a up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting >> > until the esmLOCKstore instruction.|e-a Or am I missing something here? >>
In the My 66000 case, Mem References can lock up to 8 cache lines.
What if two processors have intersecting (but not fully overlapping)
sets of those 8 cache lines?
Can you guarantee forward progress?
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as >>> a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across >>> buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same >>> address range in the middle of an update.
I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus
nothing is required of other cores, no locks, etc. If they write to a
"protected" location, the write is allowed, but the core in the ESM is
notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.
esmINTERFERENCE seems to require multiple of these exclusive blocks
to cover non-contiguous address ranges, which on first blush leads
me to worry both about deadlock situations and starvation issues.
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes
as a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across buffers and bus bridges. It would have to go to the memory
coherence point. Otherwise, some other device using a bridge could
update the same address range in the middle of an update.
I may be wrong about this, but I think you have a misconception.-a The
ESM doesn't *prevent* interference, but it *detect* interference.-a Thus nothing is required of other cores, no locks, etc.-a If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.
I am assuming the esmLockStore() just unlocks what was previously
locked and the stores have already happened by that time.
There is no "locking" in the sense of preventing any accesses.
On 08/12/2025 17:23, Stephen Fuld wrote:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes
as a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across buffers and bus bridges. It would have to go to the memory
coherence point. Otherwise, some other device using a bridge could
update the same address range in the middle of an update.
I may be wrong about this, but I think you have a misconception.-a The
ESM doesn't *prevent* interference, but it *detect* interference.-a Thus nothing is required of other cores, no locks, etc.-a If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.
Yes, that is correct (as far as I understand it now). The critical part---------------------------------
is the hidden hardware loop that was not mentioned or indicated in the original code.
There are basically two ways to handle atomic operations. One way is to
use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)
The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in hardware, but there are no benefits in hiding it from the programmer!)
David Brown <david.brown@hesbynett.no> posted:
There are basically two ways to handle atomic operations. One way is to
use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the
situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)
The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a
hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)
How exactly do you inform the programmer that:
InBound [Address]
OutBound [Address]
operates like::
try_again:
InBound [Address]
BIN try_again
OutBound [Address]
And why clutter up asm with extraneous labels and require extra instructions.
On 09/12/2025 20:15, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
There are basically two ways to handle atomic operations. One way is to >> use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the >> situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)
The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a >> hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)
How exactly do you inform the programmer that:
InBound [Address]
OutBound [Address]
operates like::
try_again:
InBound [Address]
BIN try_again
OutBound [Address]
And why clutter up asm with extraneous labels and require extra instructions.
The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.
Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter nearly as much.)
So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
of better names, but that would at least give the reader a clue that
there's something odd going on.
Mostly esm detects interference but there are times when esm is allowed
to ignore interference.
Consider a sever scale esm implementation. In such an implementation, esm
is enhanced with a system* arbiter.
After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting
no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
At this point the core is in "careful" mode,
core becomes sequentially
consistent, SW chooses to re-run the event. Here, cache misses leave
core in program order,... When interference is detected, the event fails
and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
lower priority interfering accesses.
On 12/9/2025 11:15 AM, MitchAlsup wrote:
snip
Mostly esm detects interference but there are times when esm is allowed
to ignore interference.
Consider a sever scale esm implementation. In such an implementation, esm is enhanced with a system* arbiter.
After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
At this point the core is in "careful" mode,
I am missing some understanding here, about this "counter". This
paragraph seems to indicate that after one failure, the core goes into "careful" mode, but if that were true, you wouldn't need a "counter",
just a mode bit. So assuming it is a counter and you need "n" failures
in a row to go into careful mode, is "n" hardwired or settable by
software? What are the tradeoffs for smaller or larger values of "n"?
core becomes sequentially
consistent, SW chooses to re-run the event. Here, cache misses leave
core in program order,... When interference is detected, the event fails and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
lower priority interfering accesses.
Again, after a single failure in careful mode or n failures? If n, is
it the same value of n as for the transition from optimistic to careful mode? Same questions as before about who sets the value and is it
software changeable?
David Brown <david.brown@hesbynett.no> posted:
On 09/12/2025 20:15, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)
The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)
How exactly do you inform the programmer that:
InBound [Address]
OutBound [Address]
operates like::
try_again:
InBound [Address]
BIN try_again
OutBound [Address]
And why clutter up asm with extraneous labels and require extra instructions.
The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.
Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter
nearly as much.)
So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()". Feel free to think
of better names, but that would at least give the reader a clue that
there's something odd going on.
This is a useful suggestion; thanks.
On the other hand, there are some non-vonNeumann actions lurking within
esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.
1st:: one cannot single step through an ATMOIC event, if you enter an
ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.
2nd::the only way to debug an event is to have a buffer of SW locations
that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan- taneously or not modified at all.
So, here we have non-participating STs having been written and older participating STs have not.
3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.
4th:: one cannot test esm with a random code generator, since the probability that the random code generator creates a legal esm event is exceedingly low.
On 09/12/2025 22:28, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 09/12/2025 20:15, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
There are basically two ways to handle atomic operations.-a One way >>>>> is toHow exactly do you inform the programmer that:
use locking mechanisms to ensure that nothing (other cores, interrupts >>>>> or other pre-emption on the same core) can break up the sequence.-a The >>>>> other way is to have a mechanism to detect conflicts and a failure of >>>>> the atomic operation, so that you can try again (or otherwise
handle the
situation).-a (You can, of course, combine these - such as by disabling >>>>> local interrupts and detecting conflicts from other cores.)
The code Mitch posted apparently had neither of these mechanisms,
hence
my confusion.-a It turns out that it /does/ have conflict detection >>>>> and a
hardware retry loop, all hidden from anyone trying to understand the >>>>> code.-a (I can appreciate that there may be benefits in doing this in >>>>> hardware, but there are no benefits in hiding it from the programmer!) >>>>
-a-a-a-a-a-a-a-a InBound-a-a [Address]
-a-a-a-a-a-a-a-a OutBound-a [Address]
operates like::
try_again:
-a-a-a-a-a-a-a-a InBound-a-a [Address]
-a-a-a-a-a-a-a-a BIN-a-a-a-a-a-a try_again
-a-a-a-a-a-a-a-a OutBound-a [Address]
And why clutter up asm with extraneous labels and require extra
instructions.
The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.
Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level.-a (Assembly instruction names don't matter >>> nearly as much.)
So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()".-a Feel free to think >>> of better names, but that would at least give the reader a clue that
there's something odd going on.
This is a useful suggestion; thanks.
I can certainly say they would help /me/ understand the code, so maybe
they would help other people understand it too.
On the other hand, there are some non-vonNeumann actions lurking within
esm. Where vonNeumann means: that every instruction is executed in its
entirety before the next instruction appears to start executing.
That's a rather different use of the term "vonNeumann" from anything I
have heard.-a I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
And are we thinking about the instructions purely from the viewpoint of
the cpu executing them?
IME, most instructions on most processors are indivisible, but most processors have some instructions that are not.-a For example, processors can have load/store multiple instructions that are interruptable - in
some cases, after returning from the interrupt (and any associated
thread context switches) the instructions are restarted, in other cases
they are continued.
But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing.-a Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
of it.
1st:: one cannot single step through an ATMOIC event, if you enter an
ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.
That is presumably a choice you made for the debugging features of the device.
2nd::the only way to debug an event is to have a buffer of SW locations
that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially
consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan-
taneously or not modified at all.
So, here we have non-participating STs having been written and older
participating STs have not.
3rd:: control transfer not under SW control--more like exceptions and
interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.
OK.-a I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).
My main concern was
the disconnect between how the code was written and what it actually does.
4th:: one cannot test esm with a random code generator, since the
probability
that the random code generator creates a legal esm event is
exceedingly low.
Testing and debugging any kind of locking or atomic access solution is always very difficult.
You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!
On 09/12/2025 22:28, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 09/12/2025 20:15, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of >>>> the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)How exactly do you inform the programmer that:
The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the >>>> code. (I can appreciate that there may be benefits in doing this in >>>> hardware, but there are no benefits in hiding it from the programmer!) >>>
InBound [Address]
OutBound [Address]
operates like::
try_again:
InBound [Address]
BIN try_again
OutBound [Address]
And why clutter up asm with extraneous labels and require extra instructions.
The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.
Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter >> nearly as much.)
So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()". Feel free to think >> of better names, but that would at least give the reader a clue that
there's something odd going on.
This is a useful suggestion; thanks.
I can certainly say they would help /me/ understand the code, so maybe
they would help other people understand it too.
On the other hand, there are some non-vonNeumann actions lurking within esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.
That's a rather different use of the term "vonNeumann" from anything I
have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
And are we thinking about the instructions purely from the viewpoint of
the cpu executing them?
IME, most instructions on most processors are indivisible, but most processors have some instructions that are not. For example, processors
can have load/store multiple instructions that are interruptable - in
some cases, after returning from the interrupt (and any associated
thread context switches) the instructions are restarted, in other cases
they are continued.
But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
of it.
1st:: one cannot single step through an ATMOIC event, if you enter an ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.
That is presumably a choice you made for the debugging features of the device.
2nd::the only way to debug an event is to have a buffer of SW locations that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan- taneously or not modified at all.
So, here we have non-participating STs having been written and older participating STs have not.
3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.
OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).
My main concern was
the disconnect between how the code was written and what it actually does.
4th:: one cannot test esm with a random code generator, since the probability
that the random code generator creates a legal esm event is exceedingly low.
Testing and debugging any kind of locking or atomic access solution is always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!
On 10/12/2025 21:10, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).
esm attempts to allow SW to program with features previously available
only at the -|Code level. -|Code allows for many -|instructions to execute before/between any real instructions.
My main concern was
the disconnect between how the code was written and what it actually does.
Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.
There is a 26 page specification the programmer needs to read and understand.
This includes things we have not talked about--such as::
a) terminating an event without writing anything
b) proactively minimizing future interference
c) modifications to cache coherence model
at the architectural level.
Fair enough. This is not a minor or simple feature!
The architectural specification allows for various scales of -|Architecture to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order -|Controller are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers the whole gamut.
4th:: one cannot test esm with a random code generator, since the probability
that the random code generator creates a legal esm event is exceedingly low.
Testing and debugging any kind of locking or atomic access solution is
always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!
Right at Christmas time !! {Ask me how I know}.
We can gather round the fire, and Grampa can settle in his rocking chair
to tell us war stories from the olden days :-)
A good story is always nice, so go for it!
(We once had a system where there was a bug that not only triggered only
at the customer's site, but did so only on the 30th of September. It
took years before we made the connection to the date and found the bug.)
Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.
We both made it home for Christmas, and in some part saved the
company...
Testing and debugging any kind of locking or atomic access solution is always very difficult.-a You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!
On 12/11/2025 1:05 AM, David Brown wrote:
On 10/12/2025 21:10, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
OK.-a I can see the advantages of that - though there are disadvantages >>>> too (such as being unable to control a limit on the number of retries, >>>> or add SW tracking of retry counts for metrics).
esm attempts to allow SW to program with features previously available
only at the -|Code level. -|Code allows for many -|instructions to execute >>> before/between any real instructions.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a My main concern was
the disconnect between how the code was written and what it actually
does.
Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and
you don't want optimisers re-arranging things too much.
Right. Way back before C/C++ 11 I would code all of my sensitive lock/ wait-free code in assembly.
[...]
On 10/12/2025 21:10, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
OK.-a I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).
esm attempts to allow SW to program with features previously available
only at the -|Code level. -|Code allows for many -|instructions to execute >> before/between any real instructions.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a My main concern was
the disconnect between how the code was written and what it actually
does.
Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.
On Thu, 11 Dec 2025 20:26:09 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
We both made it home for Christmas, and in some part saved the
company...
Not for long so... Was not it dead anyway in the 6-7 months?
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:Such as ????
"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>>> guarantees for multi-threaded coding on multi-core systems.You describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>>>> affects the hardware.-a So volatile writes are ordered at the C level, >>>>>>>> but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>
However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>>
Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>>
The functions below rely on more than that - to make the work, as far as >>> I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction.-a Or am I missing something here?
Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?
In the My 66000 case, Mem References can lock up to 8 cache lines.
On 12/8/2025 12:06 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enoughYou describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>>> that
affects the hardware.-a So volatile writes are ordered at the C >>>>>>>>> level,
but that says nothing about how they might progress through >>>>>>>>> storage
queues, caches, inter-processor communication buses, or whatever. >>>>>>>>
ordering
on hardware with weaker memory ordering than sequential
consistency".
If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
guarantees that are as good on multi-core machines as on single- >>>>>>>> core
machines.
However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core
systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction. >>>>>
The functions below rely on more than that - to make the work, as
far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction.-a Or am I missing something here?
Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?
In the My 66000 case, Mem References can lock up to 8 cache lines.
Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system.
or cmpxchg16b on a 64-bit system.
People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.
On 12/8/2025 12:06 PM, MitchAlsup wrote:[...]
People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.
On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:
On 12/8/2025 12:06 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/6/2025 5:42 AM, David Brown wrote:
On 05/12/2025 21:54, MitchAlsup wrote:Lock the BUS? Only when shit hits the fan. What about locking the cache >>> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >>> do a LOCK'ed RMW on memory that straddles cache lines?
David Brown <david.brown@hesbynett.no> posted:
On 05/12/2025 18:57, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
David Brown <david.brown@hesbynett.no> writes:
"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enoughYou describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory >>>>>>>> ordering
guarantees for multi-threaded coding on multi-core systems. >>>>>>>>> Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>>> that
affects the hardware.-a So volatile writes are ordered at the C >>>>>>>>> level,
but that says nothing about how they might progress through >>>>>>>>> storage
queues, caches, inter-processor communication buses, or whatever. >>>>>>>>
on hardware with weaker memory ordering than sequential
consistency".
If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
guarantees that are as good on multi-core machines as on single- >>>>>>>> core
machines.
However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core >>>>>>>> systems),
Such as ????
Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.
My 66000 ISA can::
LDM/STM can LD/ST up to 32-a-a DWs-a-a as a single ATOMIC instruction. >>>>> MM-a-a-a-a-a can MOV-a-a up to 8192 bytes as a single ATOMIC instruction.
The functions below rely on more than that - to make the work, as
far as
I can see, you need the first "esmLOCKload" to lock the bus and also >>>> lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction.-a Or am I missing something here? >>>
In the My 66000 case, Mem References can lock up to 8 cache lines.
Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system. or cmpxchg16b on a 64-bit system.
People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.
Have you ever read about KCSS?
https://groups.google.com/g/comp.arch/c/shshLdF1uqs
https://patents.google.com/patent/US7293143
Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.
On 12/8/2025 9:14 AM, Scott Lurndal wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the
same
address range in the middle of an update.
I may be wrong about this, but I think you have a misconception.-a The
ESM doesn't *prevent* interference, but it *detect* interference.-a Thus >>> nothing is required of other cores, no locks, etc.-a If they write to a
"protected" location, the write is allowed, but the core in the ESM is
notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other
agent has modified any byte within the exclusive range.
Any mutation the reservation granule?
What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen
the guarantees of forward progress.
On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
On 12/8/2025 9:14 AM, Scott Lurndal wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence >>>> point. Otherwise, some other device using a bridge could update the >>>> same
address range in the middle of an update.
I may be wrong about this, but I think you have a misconception.-a The >>> ESM doesn't *prevent* interference, but it *detect* interference.-a Thus >>> nothing is required of other cores, no locks, etc.-a If they write to a >>> "protected" location, the write is allowed, but the core in the ESM is >>> notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other >> agent has modified any byte within the exclusive range.
Any mutation the reservation granule?
I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen >the guarantees of forward progress.
How does it strengthen the guarantees of forward progress?
My guess:
If the requester itself is in an atomic sequence B, it will cancel it.
This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.
There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
On 12/8/2025 9:14 AM, Scott Lurndal wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be >>>>>> locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence >>>>>> point. Otherwise, some other device using a bridge could update the >>>>>> same
address range in the middle of an update.
I may be wrong about this, but I think you have a misconception.-a The >>>>> ESM doesn't *prevent* interference, but it *detect* interference.-a Thus >>>>> nothing is required of other cores, no locks, etc.-a If they write to a >>>>> "protected" location, the write is allowed, but the core in the ESM is >>>>> notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor" >>>> (the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined >>>> range surrounding the target address and the store will fail if any other >>>> agent has modified any byte within the exclusive range.
Any mutation the reservation granule?
I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.
A LD to the granule would cause loss of write permission, causing a long delay to perform SC and greatly increase the probability of interference.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >>> referenced, but, no you can't have it right now" in order to strengthen
the guarantees of forward progress.
How does it strengthen the guarantees of forward progress?
The allowance of a NaK is only available under somewhat special circumstances::
a) in Careful mode:: when core can see that all STs have write permission
and data is present, NaKs allow the Modification part to run to
completion.
b) In Slow and Methodical mode:: core can NaK any access to any of its
cache lines--preventing interference.
My guess:
If the requester itself is in an atomic sequence B, it will cancel it.
Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
the event by the time the innocent request shows up again.
This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.
There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.
Instead of some contrived back-off policy--at the failure point one can
read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be.
So, if you are going after a unit of work, you march down the queue WHY
units and then YOU are guaranteed that YOU are the only one after that
unit of work.
On 12/13/2025 11:12 AM, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you
referenced, but, no you can't have it right now" in order to strengthen >>> the guarantees of forward progress.
How does it strengthen the guarantees of forward progress?
The allowance of a NaK is only available under somewhat special circumstances::
a) in Careful mode:: when core can see that all STs have write permission
and data is present, NaKs allow the Modification part to run to
completion.
b) In Slow and Methodical mode:: core can NaK any access to any of its
cache lines--preventing interference.
My guess:
If the requester itself is in an atomic sequence B, it will cancel it.
Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
the event by the time the innocent request shows up again.
This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.
There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.
Instead of some contrived back-off policy--at the failure point one can read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be. So, if you are going after a unit of work, you march down the queue WHY units and then YOU are guaranteed that YOU are the only one after that
unit of work.
Step one. Make sure that a failure means another thread made progress. strong CAS does this. Don't let it spuriously fail where nothing makes progress... ;^o
Oh my we got a load on the reservation granule, abort all LL/SC in
progress wrt that granule. Of course this assumes that the user that
created the program for it gets things right.
For a LL/SC on the PPC it definitely helps where things are aligned and padded up to a reservation granule, not just a l2 cache line. Helps mitigate false sharing causing livelock.
Even in weak CAS, akin to LL/SC. Well, how sensitive is that reservation granule. Can a simple load cause a failure?
On 12/13/2025 11:03 AM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
On 12/8/2025 9:14 AM, Scott Lurndal wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 12/8/2025 4:25 AM, Robert Finch wrote:
<snip>
I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be >>>>>> locked would have to be supplied throughout the system, including >>>>>> across
buffers and bus bridges. It would have to go to the memory coherence >>>>>> point. Otherwise, some other device using a bridge could update the >>>>>> same
address range in the middle of an update.
I may be wrong about this, but I think you have a misconception.-a The >>>>> ESM doesn't *prevent* interference, but it *detect* interference.-a Thus
nothing is required of other cores, no locks, etc.-a If they write to a >>>>> "protected" location, the write is allowed, but the core in the ESM is >>>>> notified, so it can redo the ESM protected code.
Sounds very much similar to the ARMv8 concept of an "exclusive monitor" >>>> (the basis of the Store-Exclusive/Load-Exclusive instructions, which >>>> mirror the LL/SC paradigm).-a The ARMv8 monitors an implementation defined
range surrounding the target address and the store will fail if any other
agent has modified any byte within the exclusive range.
Any mutation the reservation granule?
I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.
A LD to the granule would cause loss of write permission, causing a long delay to perform SC and greatly increase the probability of interference.
So, you need to create a rule. If you program for my system, you MUST
make sure that everything is properly aligned and padded. Been there,
done that.
Now, think of nefarious agents... I was able to cause damage
to a simple strong CAS loop with another thread(s) mutating the cache
line on purpose, as a stress test... CAS would start hitting higher and higher failure rates, and finally, hit the BUS to ensure some sort of forward progress.
Sure. Of course multi-core systems will not have that hardware guarantee,
at least not on main memory, for performance reasons.
On Tue, 11 Nov 2025 21:34:08 -0600
BGB <cr88192@gmail.com> wrote:
Going to/from 128-bit integer adds a few "there be dragons here"
issues regarding performance.
Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
There is also psychological factor at play - your users expect
division and square root to be slower than other primitive FP
operations, so they are not disappointed. Possibly they are even
pleasantly surprised, when they find out that the difference in
throughput between division and multiplication is smaller than factor
20-30 that they were accustomed to for 'double' on their 20 y.o. Intel
and AMD.
Today I tested speed of gcc implementation of Decimal128 (BID-encoded,
of course) on Intel Core i7-14700.
Average time in nsec:
op Add Sub Mul Div
P-Core 33 33 86 76
E-Core 46 48 121 108
Counter-intuitively, division is faster than multiplications.
And both appear much slower than necessary.
I am just noticing that the actual physical register name is not needed until lookup at the reservation stations. In Qupls4 it can be a few
clock cycles before the register lookup is done. So, an incorrect one
could be supplied at the rename stage; it only has to be good enough to
work out dependencies. Would a sequence number based register name work? (Rather than reading a fifo). Then it is a matter of correcting it later.
Michael S <already5chosen@yahoo.com> schrieb:
Today I tested speed of gcc implementation of Decimal128
(BID-encoded, of course) on Intel Core i7-14700.
Average time in nsec:
op Add Sub Mul Div
P-Core 33 33 86 76
E-Core 46 48 121 108
Counter-intuitively, division is faster than multiplications.
And both appear much slower than necessary.
Interesting. Could you provide the benchmark used?
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 54 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 17:44:29 |
| Calls: | 742 |
| Files: | 1,218 |
| D/L today: |
4 files (8,203K bytes) |
| Messages: | 184,414 |
| Posted today: | 1 |