Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
On 2026-May-08 19:34, MitchAlsup wrote:
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Do you mean changes a single line from using write-invalidate
protocol to write-update so any remote writes are forwarded
by the home directory to the current line owner?
In effect, blocks line movement but not updates.
Or something else?
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
Not unprivileged or applications could un-zero fields that had
been intentionally zeroed out but still held in cache.
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
On 2026-05-08 7:34 p.m., MitchAlsup wrote:
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Trying to fathom what is going on with this. Is it an issue with keeping
the cache coherent? Sounds like the D$ cache line was write-protected
and now it is to be made writable?
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
Q+ has several I$ and D$ cache operations wrapped up in a single
instruction called rCyCACHErCO. I thought it best to put these in one instruction since they are infrequently used. The instruction has the
same format as a load/store but the source/dest register is replaced by
a command code. It uses the supplied address (if an address is needed).
Turn cache on/off (D$ only)
Invalidate entire cache (I$ or D$ or both)
Invalidate cache line (I$ or D$ or both)
Invalidate TLB
Invalidate TLB entry
Both the I$ and D$ caches can be invalidated with a single instruction.
Robert Finch <robfi680@gmail.com> posted:
On 2026-05-08 7:34 p.m., MitchAlsup wrote:
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Trying to fathom what is going on with this. Is it an issue with keeping the cache coherent? Sounds like the D$ cache line was write-protected
and now it is to be made writable?
Consider the stack, and after adding a number to SP there are now
a bunch of lines that are neither accessible nor containing a useful
value.
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
Q+ has several I$ and D$ cache operations wrapped up in a single instruction called rCyCACHErCO. I thought it best to put these in one instruction since they are infrequently used. The instruction has the
same format as a load/store but the source/dest register is replaced by
a command code. It uses the supplied address (if an address is needed).
I have the same format, a memory reference that does not need a DST register specifier, so it becomes the OpCode.
Turn cache on/off (D$ only)
Why would you want the cache turned off??
Invalidate entire cache (I$ or D$ or both)
What if the cache is 1GB in size ??? This could take a long time.
Invalidate cache line (I$ or D$ or both)
Invalidate TLB
With a coherent TLB this is unnecessary.
Invalidate TLB entry
--- Synchronet 3.22a-Linux NewsLink 1.2Both the I$ and D$ caches can be invalidated with a single instruction.
That may take a long time !
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
On 5/8/26 7:34 PM, MitchAlsup wrote:
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Intel added Cache Line WriteBack to memory to help with memory
persistence (IIRC), which can be viewed as a reliability
assertion (data will not be lost on power failure).
There could
also be performance reasons for pushing data outward while
retaining it locally in a clean (shared) state; a remote request
for the data might have lower latency (sourcing directly from
L3, e.g., rather than an L3 coherence directory indicating where
the data is and having to request the data from and a state
change for the owner).
For L1 to L2, cache line granularity might be too fine for
'checkpointing' data from a merely parity protected L1 to an
ECC-protected L2, though My 66000's VVM (with appropriate
acceleration) might make substantial blocks fast/low overhead.
On the other hand, assigning reliability factor at a page level
might be awkward from PTE bit starvation, granularity
inflexibility, and timing.
Would this also ensure data presence in outer cache/memory on a
clean line? E.g., if applied with an L2 target when L2 is non-
inclusive (but possibly tag inclusive or at least snoop
filtering) and the line is clean, would the line be written back
if not present in L2?
If one had a mode that disallowed escape of dirty lines, this
might be used as a means to commit temporary, local values. This
seems somewhat similar to a transactional memory mechanism,
though transactional memory would typically distinguish old
dirty lines (and perhaps clean ones) allowing them to be written
back on replacement.
I also wonder if this might be used to assist in determining
what cache indexes have been replaced in L2. With lazy writeback
the timing factors may be fuzzed more. My mind does not work
well for this type of problem.
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
EricP pointed out a possible security issue if OS page zeroing
could be thwarted.
This could be worked around by having such
page (or cache line) zeroing use special cases that act as if
the zeroed memory was written back to memory. Forcing a
distinguishing between explicit zeroing to provide a base value
and zeroing to remove access to old data may facilitate software
bugs when the difference is not recognized/remembered.
This is similar to the problem that data cache block allocate
had where old data (that the current thread was not permitted to
read) of a possibly different address could be read. This was
generally "solved" by defining allocation as either no-op on a
cache hit and cache block zero on a miss.
Since the benefit of
doing nothing on a cache hit may not have been very beneficial
(one might use a bit of cache bandwidth) and zeroing provides
other benefits, block zeroing seems to be preferred (though I
still like allocation).
(This also is reminiscent of the Mill's unbacked memory, which
was memory that reads as zero [providing an implicit data cache
block zero] and has no physical memory address until evicted
from last level cache.
For highly temporary data, the data
would never leave the cache; this could also allow cache as
memory as long as no cache was forced to be written back. I do
not know if unbacked memory allowed an application to release
the memory, which would be like an invalidate without
writeback.)
Optimistic updates sounds similar to transactional memory or
versioned memory.
With respect to Stefan Monnier's seeing this as undefined
behavior, I think this might be presented similarly to memory
ordering with a weaker memory model. I.e., the result of a read
would still return a previously held value, but the "version"
might be unexpected. The result is not "undefined" but timing
dependent.
I suspect one would have to be very careful about defining how
such would interact with ESM (and perhaps other memory
interaction methods).
If the memory so cleared is thread local, I do not _think_--- Synchronet 3.22a-Linux NewsLink 1.2
there would be consistency issues. (I think IBM defined "local"
memory transactions which supported speculation but not system
atomicity.) Yet I feel that there might be uses for value
checkpointing (versioning) where the address is shared by
multiple threads.
Obviously, hardware could in some cases interweave versions into
a consistent order, but forcing software to handle the cases
when hardware fails sounds problematic. Explicit checkpoints
like with transactional memory, might be easier for programmers
to use correctly than a fully flexible handling of speculation.
On the other hand, finer-grained control could allow software
to exploit knowledge that is not easily observed by (or
communicated to) hardware.
I think there are opportunities for versioned memory and/or
other timing/speculation manipulation, but I do not have a clue
about what interface should be presented to software. A RISC-
like approach of cache line control instructions could provide
flexibility, but the overhead for idiom recognition should also
be considered.
Modal operation (like transactional memory or ASM) simplifies
some aspects and complicates others.
I tend to favor complexity (flexibility), so my opinion is
dangerous.
Paul Clayton <paaronclayton@gmail.com> posted:
On 5/8/26 7:34 PM, MitchAlsup wrote:
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Intel added Cache Line WriteBack to memory to help with memory
persistence (IIRC), which can be viewed as a reliability
assertion (data will not be lost on power failure).
Thank you Paul ! An interesting rational
There could
also be performance reasons for pushing data outward while
retaining it locally in a clean (shared) state; a remote request
for the data might have lower latency (sourcing directly from
L3, e.g., rather than an L3 coherence directory indicating where
the data is and having to request the data from and a state
change for the owner).
In a directory based caching system the directory is in a position
that ANY shared cache line can be granted into the Exclusive state {minimizing transfer distance}.
For L1 to L2, cache line granularity might be too fine for
'checkpointing' data from a merely parity protected L1 to an
ECC-protected L2, though My 66000's VVM (with appropriate
acceleration) might make substantial blocks fast/low overhead.
If you care about RAS, you cannot have write back L1 caches
with that property.
On the other hand, assigning reliability factor at a page level
might be awkward from PTE bit starvation, granularity
inflexibility, and timing.
Would this also ensure data presence in outer cache/memory on a
clean line? E.g., if applied with an L2 target when L2 is non-
inclusive (but possibly tag inclusive or at least snoop
filtering) and the line is clean, would the line be written back
if not present in L2?
A whole different can or worms.....
If one had a mode that disallowed escape of dirty lines, this
might be used as a means to commit temporary, local values. This
seems somewhat similar to a transactional memory mechanism,
though transactional memory would typically distinguish old
dirty lines (and perhaps clean ones) allowing them to be written
back on replacement.
Luckily, I have a fundamental disagreement on ISA-extensions that
provide SW the illusion that "lots or places" can be in intermediate
states (i.e. TM).
I also wonder if this might be used to assist in determining
what cache indexes have been replaced in L2. With lazy writeback
the timing factors may be fuzzed more. My mind does not work
well for this type of problem.
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
EricP pointed out a possible security issue if OS page zeroing
could be thwarted.
Why is the OS zeroing a page that has already been mapped into
unprivileged VAS ???
Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
and the interconnect is designed to transport the page zero in one transaction.
This could be worked around by having such
page (or cache line) zeroing use special cases that act as if
the zeroed memory was written back to memory. Forcing a
distinguishing between explicit zeroing to provide a base value
and zeroing to remove access to old data may facilitate software
bugs when the difference is not recognized/remembered.
This is similar to the problem that data cache block allocate
had where old data (that the current thread was not permitted to
read) of a possibly different address could be read. This was
generally "solved" by defining allocation as either no-op on a
cache hit and cache block zero on a miss.
vVM is allowed to 'allocate' cache lines (CI without Read) when
a line boundary is crossed and more than 1 complete line remains
in the loop--saving interconnect BW and coherence messages.
Since the benefit of
doing nothing on a cache hit may not have been very beneficial
(one might use a bit of cache bandwidth) and zeroing provides
other benefits, block zeroing seems to be preferred (though I
still like allocation).
(This also is reminiscent of the Mill's unbacked memory, which
was memory that reads as zero [providing an implicit data cache
block zero] and has no physical memory address until evicted
from last level cache.
I always liked that feature. I count not work it into a more
conventional architecture, except for the 'known' program stack.
For highly temporary data, the data
would never leave the cache; this could also allow cache as
memory as long as no cache was forced to be written back. I do
not know if unbacked memory allowed an application to release
the memory, which would be like an invalidate without
writeback.)
Known stack.
Optimistic updates sounds similar to transactional memory or
versioned memory.
With respect to Stefan Monnier's seeing this as undefined
behavior, I think this might be presented similarly to memory
ordering with a weaker memory model. I.e., the result of a read
would still return a previously held value, but the "version"
might be unexpected. The result is not "undefined" but timing
dependent.
SW would consider this undefined--SW depends (way too much) of
a read returning exactly the last thing read.
I suspect one would have to be very careful about defining how
such would interact with ESM (and perhaps other memory
interaction methods).
Any kind of ATOMIC thing is WAY better to do it correct and
SLOW than to take ANY chance of doing it wrong.
If the memory so cleared is thread local, I do not _think_
there would be consistency issues. (I think IBM defined "local"
memory transactions which supported speculation but not system
atomicity.) Yet I feel that there might be uses for value
checkpointing (versioning) where the address is shared by
multiple threads.
Obviously, hardware could in some cases interweave versions into
a consistent order, but forcing software to handle the cases
when hardware fails sounds problematic. Explicit checkpoints
like with transactional memory, might be easier for programmers
to use correctly than a fully flexible handling of speculation.
On the other hand, finer-grained control could allow software
to exploit knowledge that is not easily observed by (or
communicated to) hardware.
I think there are opportunities for versioned memory and/or
other timing/speculation manipulation, but I do not have a clue
about what interface should be presented to software. A RISC-
like approach of cache line control instructions could provide
flexibility, but the overhead for idiom recognition should also
be considered.
Modal operation (like transactional memory or ASM) simplifies
some aspects and complicates others.
I tend to favor complexity (flexibility), so my opinion is
dangerous.
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
With respect to Stefan Monnier's seeing this as undefined
behavior, I think this might be presented similarly to memory
ordering with a weaker memory model.
I.e., the result of a read
would still return a previously held value, but the "version"
might be unexpected. The result is not "undefined" but timing
dependent.
On 5/10/26 8:33 PM, MitchAlsup wrote:[...]
The OS zeros the physical page before assigning it to the new
context (or more likely assigns a zero page and does copy on
write, which is just zeroing the page).
Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
and the interconnect is designed to transport the page zero in one
transaction.
This is more flexible than having cache line and page clearing
instructions.
Robert Finch <robfi680@gmail.com> posted:
On 2026-05-08 7:34 p.m., MitchAlsup wrote:
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Trying to fathom what is going on with this. Is it an issue with keeping
the cache coherent? Sounds like the D$ cache line was write-protected
and now it is to be made writable?
Consider the stack, and after adding a number to SP there are now
a bunch of lines that are neither accessible nor containing a useful
value.
Paul Clayton <paaronclayton@gmail.com> writes:
On 5/10/26 8:33 PM, MitchAlsup wrote:[...]
The OS zeros the physical page before assigning it to the new
context (or more likely assigns a zero page and does copy on
write, which is just zeroing the page).
Assigning a zero page for reading is a good idea. Copying that page
on writing appears inefficient to me, because it needs to read the
zero page into cache and write it to a newly allocated page.
On 5/10/26 8:33 PM, MitchAlsup wrote:--------------------------
If you care about RAS, you cannot have write back L1 caches
with that property.
Different customers may have different preferences. I seem to
recall that Intel offered the option to replicate parity-
protected L1 data cache to allow recovery (at the cost of half
the capacity).
vVM is allowed to 'allocate' cache lines (CI without Read) when
a line boundary is crossed and more than 1 complete line remains
in the loop--saving interconnect BW and coherence messages.
That may be the most common use for avoiding read-to-own, but it
is not the only use.
Any kind of ATOMIC thing is WAY better to do it correct and
SLOW than to take ANY chance of doing it wrong.
Yes, though slow can also motivate incorrect software. Being
able to clearly communicate the dangers also seems important
(which argues for simplicity/orthogonality).
Paul Clayton <paaronclayton@gmail.com> writes:
On 5/10/26 8:33 PM, MitchAlsup wrote:[...]
The OS zeros the physical page before assigning it to the new
context (or more likely assigns a zero page and does copy on
write, which is just zeroing the page).
Assigning a zero page for reading is a good idea. Copying that page
on writing appears inefficient to me, because it needs to read the
zero page into cache and write it to a newly allocated page.
A better approach is to do just the writes. I think that zeroing the
page on demand is a good approach, because then it is already in the
D-cache,
but AFAIK Linux actually zeros physical pages ahead of time typically on a separate (otherwise idle) core, and just maps one of
those pages to the virtual page that needs to be written to. I wonder
why Linux does that.
Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
and the interconnect is designed to transport the page zero in one
transaction.
This is more flexible than having cache line and page clearing >instructions.
In what way is it more flexible? It is a page-clearing instruction.
- anton--- Synchronet 3.22a-Linux NewsLink 1.2
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Robert Finch <robfi680@gmail.com> posted:
On 2026-05-08 7:34 p.m., MitchAlsup wrote:
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Trying to fathom what is going on with this. Is it an issue with keeping >> the cache coherent? Sounds like the D$ cache line was write-protected
and now it is to be made writable?
Consider the stack, and after adding a number to SP there are now
a bunch of lines that are neither accessible nor containing a useful
value.
Seems to me that the code will certainly call another function
almost immediately that will simply reuse the already
present stack cache line; prematurely invalidating it will
actually slow things down.
I see no benefit in invalidating it pre-emptively.
It would certainly cause problems for code that intentionally
uses the soi disant "free" stack space in legal but unusual ways.
There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.
Upon updating instructions (e.g., from a JIT compiler), they require
that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.
Paul Clayton <paaronclayton@gmail.com> writes:
With respect to Stefan Monnier's seeing this as undefined
behavior, I think this might be presented similarly to memory
ordering with a weaker memory model.
Is that supposed to be a defense of the discard instruction? It
isn't. Descriptions of weak memory models are full of "undefined
behaviour".
Weak memory models are a bad idea like many supercomputer ideas (e.g., division with wrong results, or imprecise exceptions), but unlike
other bad ideas they have made it almost to general-purpose computing.
And they are exactly bad ideas because:
1) The unpredictable results if their restrictions are not heeded.
2) The difficulty of heeding these restrictions by adding close to the minimum necessary strongifying instructions (e.g., memory barriers and
atomic instructions). In particular, thanks to 1 there is no way to
check the correctness of the placement of these instructions by
testing.
3) The extreme performance cost of the strongifying instructions, so
when you use some simple scheme that guarantees correctness (e.g.,
inserting a write barrier after every store and a read barrier before
every load), the resulting program is extrememly slow.
In the case of weak memory models the hardware designers have the
excuse that they are too lazy to implement a strong memory model
efficiently (although they typically frame it by showing the
inefficiency of some lazy implementation of a strong memory model),
and that not that big parts of the software actually communicate with
other threads.
But I think that the chilling effects of difficulties in inter-thread communication have kept that back. But difficulties already exist
with sequential consistency; transactional memory looked like it might
come to the rescue, but after the hype from about 20 years ago is now
in the valley of disappointment.
I.e., the result of a read
would still return a previously held value, but the "version"
might be unexpected. The result is not "undefined" but timing
dependent.
"Undefined behaviour" typically originally means something where the
people specifying it have a good idea what can happen, but where it is
too complex and has too little benefit to actually specify it. E.g.,
an out-of-bounds access to an object resulted in actually accessing
that memory, but what is there is not specified and is implementation-dependent, and it might be that the address is not
accessible, and results in a trap (and it seems to me that everything
that may result in a trap on some machines has been labeled "undefined behaviour" in C; note the fine differences between
implementation-defined and undefined behaviour for shifts in C). Only
later compiler shenanigans like "optimizing" a loop that performs an out-of-bounds access into an endless loop were introduced and
justified with the specified undefined behaviour.
If MY66000 ever becomes a popular architecture with many
implementations, and if discard's effect has been specified as making
any read access before a write access to any memory in the cache line "undefined behaviour", we may see implementations that implement
discard with effects that do not reflect your expectations at all.
Paul Clayton <paaronclayton@gmail.com> writes:
On 5/10/26 8:33 PM, MitchAlsup wrote:[...]
The OS zeros the physical page before assigning it to the new
context (or more likely assigns a zero page and does copy on
write, which is just zeroing the page).
Assigning a zero page for reading is a good idea. Copying that page
on writing appears inefficient to me, because it needs to read the
zero page into cache and write it to a newly allocated page.
A better approach is to do just the writes. I think that zeroing the
page on demand is a good approach, because then it is already in the
D-cache, but AFAIK Linux actually zeros physical pages ahead of time typically on a separate (otherwise idle) core, and just maps one of
those pages to the virtual page that needs to be written to. I wonder
why Linux does that.
Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
and the interconnect is designed to transport the page zero in one
transaction.
This is more flexible than having cache line and page clearing
instructions.
In what way is it more flexible? It is a page-clearing instruction.
MitchAlsup <user5857@newsgrouper.org.invalid> posted:[snip cache control instruction inclusion question]
Robert Finch <robfi680@gmail.com> posted:
I broke the instruction into 3 sub-groups::
a) prefetch
b) invalidate
c) post-push
Prefetch brings data closer to the CPU (caches) and provides a specifier
to which cache {I$, D$, L2, L3} and whether one wants write permission
(or not).
Invalidate gets rid of cached data without writing back.
Post-Push pushes modified data farther from the PCU caches.
I launched this topic because I can put as many as 32 instructions in this sub-group, and after months of thinking, I only found 19 to put there.
{yes this violated the R in RISC should me reduced}
Turn cache on/off (D$ only)
Why would you want the cache turned off??
Both the I$ and D$ caches can be invalidated with a single instruction.
That may take a long time !
On 5/11/26 2:39 AM, Anton Ertl wrote:
Paul Clayton <paaronclayton@gmail.com> writes:
With respect to Stefan Monnier's seeing this as undefined
behavior, I think this might be presented similarly to memory
ordering with a weaker memory model.
Is that supposed to be a defense of the discard instruction? It
isn't. Descriptions of weak memory models are full of "undefined behaviour".
Yes, or at least a statement that if a weaker memory model is
accepted (which I think Mitch chose for My 66000) then such
undefined behavior could be acceptable.
Weak memory models are a bad idea like many supercomputer ideas (e.g., division with wrong results,
or imprecise exceptions),
but unlike
other bad ideas they have made it almost to general-purpose computing.
And they are exactly bad ideas because:
1) The unpredictable results if their restrictions are not heeded.
Is this not also true for sequential consistency without locks
(i.e., with race conditions). If locks (or often challenging to
design lockfree methods) are acceptable "restrictions", then why
not allow memory barriers.
If locks do not implement barriers themselves in a given
programming language/library (which has performance costs so a
"zero abstraction cost" language like C++ might want to avoid
such), this would presumably add significant correctness
risk/complexity.
2) The difficulty of heeding these restrictions by adding close to the minimum necessary strongifying instructions (e.g., memory barriers and atomic instructions). In particular, thanks to 1 there is no way to
check the correctness of the placement of these instructions by
testing.
I think hardware could provide detection of at least some race
conditions. This could also help with debugging improper lock
use.
Using the minimal memory barrier strengths seems like using
lock-free programming; there might be some cases that are
trivial and safe, but I would tend to push such toward "experts
only" and the performance gains would typically not be that
great if "moderate" effort is made in hardware.
Not being a hardware person, I do not know how expensive (in power-performance-area or design complexity) sequential
consistency is relative to a more relaxed model with significant
optimization of barriers.
I _suspect_ such depends of the
communication between threads (cores/caches), perhaps especially
the scale of the system.
3) The extreme performance cost of the strongifying instructions, so
when you use some simple scheme that guarantees correctness (e.g., inserting a write barrier after every store and a read barrier before
every load), the resulting program is extrememly slow.
This assumes that barriers are expensive. This is similar to
assuming that context switches are expensive; it is a historical
fact but not a technical necessity.
*If* there is very little advantage to optimized barrier weaker
consistency, then using a simpler abstraction would probably be
a better choice.
In the case of weak memory models the hardware designers have the
excuse that they are too lazy to implement a strong memory model efficiently (although they typically frame it by showing the
inefficiency of some lazy implementation of a strong memory model),
and that not that big parts of the software actually communicate with
other threads.
While not wanting to do hard and underappreciated work is
understandable and humans will tend to justify their choices to
themselves and others using less than perfect argumentation, I
think the above statement is too hostile to hardware designers.
There may well be a "laziness" in not researching what the costs
are in current software,
how much a weaker memory model
discourages new useful software techniques,
and how complex and
effective hardware mechanisms for providing stronger consistency
models are. But laziness is generally a combination of fear,
fatigue, and lack of motivation; adding complexity increases
schedule risk (fear) and without customers pushing for a feature
(and a belief that the feature can be delivered within time and
financial budgets) motivation will be lower. I would not be
surprised that there is also a fatigue factor.
Calling a person lazy seems generally ineffective in reducing
fear or increasing motivation much less reducing fatigue. I
suspect this strategy is also not very effective at an
organization level.
On a personal level, one can even know that fears are
irrational, that most of the fatigue is psychologically induced,
and that getting something done is usually good yet be unable to
act.
Incremental exposure tends to help with fears rCo how can hardware
designers incrementally add complexity with respect to memory
model [my guess would be by working on reducing the cost of
barriers in many cases while still requiring them]? Fatigue can
be countered by limiting analysis and receiving fresh
perspectives; maybe at the end of the AI hype cycle the could be
a burst of activity in multithreaded programming and maybe
academia or "for profit" research organizations can ease the
perception of risk that encourages overanalysis (cheap capital
might also help, so again after the AI hype cycle). Research and
development could also address motivation; while it has been
known for decades that the cost of sequential consistency could
be substantially reduced by speculative out-of-order execution,
the tradeoffs change with time (both in hardware complexity and
type of software encouraging hardware development).
But I think that the chilling effects of difficulties in inter-thread communication have kept that back. But difficulties already exist
with sequential consistency; transactional memory looked like it might
come to the rescue, but after the hype from about 20 years ago is now
in the valley of disappointment.
Interestingly (to me at least) the hardware behind transaction
memory can also be used to detect when a thread is accessing
(possibly) lock-protected memory without acquiring a lock.
Even the end of Dennard scaling and other hardware factors (as
well as theoretical ILP limits) reducing the single-thread
performance improvement , it is disappointing that
multithreaded programs have not become more common (except that
many uses have good enough performance, which is not
disappointing).
I suspect transaction memory was sold too much as a trivial
solution and some issues were not well understood.
I also
suspect that a more gradual adoption with long-range plans for
extension (which plans could be adjusted with gained experience)
might have been more effective.
(The issue I have with limited optimistic concurrency mechanisms
like AMD's Advanced Synchronization Facility and My 66000's
Exotic Synchronization Mechanism is not the initial limits but
that there seems to be little presentation of an interface that
can be extended.
Just as such avoid requiring new instructions
for every new simpler atomic operation, a broader interface
conception might allow extension without adding new
instructions.
Of course, just as early broad software
abstractions present the risk of choosing the wrong abstraction
from lack of experience, having too many exceptional cases, and
delaying release, an ISA can be designed with excessive
flexibility that is not exploited much later and has immediate
costs.)
I.e., the result of a read
would still return a previously held value, but the "version"
might be unexpected. The result is not "undefined" but timing
dependent.
"Undefined behaviour" typically originally means something where the
people specifying it have a good idea what can happen, but where it is
too complex and has too little benefit to actually specify it. E.g.,
an out-of-bounds access to an object resulted in actually accessing
that memory, but what is there is not specified and is implementation-dependent, and it might be that the address is not accessible, and results in a trap (and it seems to me that everything
that may result in a trap on some machines has been labeled "undefined behaviour" in C; note the fine differences between
implementation-defined and undefined behaviour for shifts in C). Only later compiler shenanigans like "optimizing" a loop that performs an out-of-bounds access into an endless loop were introduced and
justified with the specified undefined behaviour.
There is also Hyrum's Law, that non-architectural behavior (like
order of hash table elements) which is seemingly consistent
within a set of implementations will be assumed. The non-
architectural behavior can either become architectural (at least
within the context of implementations used by the set of
software depending on that behavior) or software will break
(possibly quietly).
(This also seems related to the Robustness Principle, "be
conservative in what you send, be liberal in what you accept".
While such allows communication to work more often by ignoring 'inconsequential' errors, it also encourages misbehavior by not
giving feedback.)
For C, I got the impression that much of the undefined behavior
was initially meant to allow support for things like hardware
that generated exceptions on signed integer overflow. These
should have been defined as target-dependent behavior. Some
behaviors may need to be platform-dependent or even compiler-
dependent (or flag dependent if the compiler developer wants to
error on not specifying a behavior rather than having a default
behavior that can optionally be overridden). I am not certain
how array overruns could be handled; in some dynamic cases such
would generate a protection exception, in some cases it could
cause arbitrary code execution.
If MY66000 ever becomes a popular architecture with many
implementations, and if discard's effect has been specified as making
any read access before a write access to any memory in the cache line "undefined behaviour", we may see implementations that implement
discard with effects that do not reflect your expectations at all.
I would expect Mitch to define the behavior within certain
bounds, including, e.g., "returns a value previously present and
accessible to the context". If the purpose is to support
something like thread-local transactional memory (where aborted
speculation can recover old values), there would have to be some
constraints for such to be useful (e.g., to prevent a
speculative value from being written back to the level of the
memory hierarchy that is treated as authoritative).
On 5/11/26 3:29 AM, Anton Ertl wrote:
A better approach is to do just the writes. I think that zeroing the
page on demand is a good approach, because then it is already in the
D-cache, but AFAIK Linux actually zeros physical pages ahead of time
typically on a separate (otherwise idle) core, and just maps one of
those pages to the virtual page that needs to be written to. I wonder
why Linux does that.
Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
and the interconnect is designed to transport the page zero in one
transaction.
This is more flexible than having cache line and page clearing
instructions.
In what way is it more flexible? It is a page-clearing instruction.
My 66000's memory set instruction is not limited to a page size
defined when the instruction was generated. IBM's Data Cache
Block Zero instruction had a compatibility problem when software
written for early PowerPC caches was to be run on POWER (G5)
with 128-byte cache blocks.
If one architecturely defines cache block size and page size,
one is stuck working around that if a different size is better.
On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.
Also the Motorola 68040.
Upon updating instructions (e.g., from a JIT compiler), they require
that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.
One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say rCLself-modifying coderCY as soon as I mentioned this, even though the two are quite different things.
I think itrCOs quite desirable that an architecture guarantees that an
(to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same
memory location will never show different values.
On 5/11/26 2:39 AM, Anton Ertl wrote:
Paul Clayton <paaronclayton@gmail.com> writes:
Weak memory models are a bad idea like many supercomputer ideas (e.g.,
division with wrong results, or imprecise exceptions), but unlike
other bad ideas they have made it almost to general-purpose computing.
And they are exactly bad ideas because:
1) The unpredictable results if their restrictions are not heeded.
Is this not also true for sequential consistency without locks
(i.e., with race conditions).
If locks (or often challenging to
design lockfree methods) are acceptable "restrictions", then why
not allow memory barriers.
2) The difficulty of heeding these restrictions by adding close to the
minimum necessary strongifying instructions (e.g., memory barriers and
atomic instructions). In particular, thanks to 1 there is no way to
check the correctness of the placement of these instructions by
testing.
I think hardware could provide detection of at least some race
conditions.
Using the minimal memory barrier strengths seems like using
lock-free programming; there might be some cases that are
trivial and safe, but I would tend to push such toward "experts
only"
and the performance gains would typically not be that
great if "moderate" effort is made in hardware.
Not being a hardware person, I do not know how expensive (in >power-performance-area or design complexity) sequential
consistency is relative to a more relaxed model with significant
optimization of barriers.
3) The extreme performance cost of the strongifying instructions, so
when you use some simple scheme that guarantees correctness (e.g.,
inserting a write barrier after every store and a read barrier before
every load), the resulting program is extrememly slow.
This assumes that barriers are expensive.
This is similar to
assuming that context switches are expensive; it is a historical
fact but not a technical necessity.
In the case of weak memory models the hardware designers have the
excuse that they are too lazy to implement a strong memory model
efficiently (although they typically frame it by showing the
inefficiency of some lazy implementation of a strong memory model),
and that not that big parts of the software actually communicate with
other threads.
While not wanting to do hard and underappreciated work is
understandable and humans will tend to justify their choices to
themselves and others using less than perfect argumentation, I
think the above statement is too hostile to hardware designers.
There may well be a "laziness"
Calling a person lazy seems generally ineffective
...But I think that the chilling effects of difficulties in inter-thread
communication have kept that back. But difficulties already exist
with sequential consistency; transactional memory looked like it might
come to the rescue, but after the hype from about 20 years ago is now
in the valley of disappointment.
Even the end of Dennard scaling and other hardware factors (as
well as theoretical ILP limits) reducing the single-thread
performance improvement , it is disappointing that
multithreaded programs have not become more common (except that
many uses have good enough performance, which is not
disappointing).
I suspect transaction memory was sold too much as a trivial
solution and some issues were not well understood.
AMD's Advanced Synchronization Facility
(This also seems related to the Robustness Principle, "be
conservative in what you send, be liberal in what you accept".
While such allows communication to work more often by ignoring >'inconsequential' errors, it also encourages misbehavior by not
giving feedback.)
On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.
Also the Motorola 68040.
Upon updating instructions (e.g., from a JIT compiler), they require
that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.
One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say rCLself-modifying coderCY as soon as I mentioned this, even though the two are quite different things.
I think itrCOs quite desirable that an architecture guarantees that an
(to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same
memory location will never show different values.
Sometimes, there is a difference between "nice to have", vs "cost effective".
It is nicer, say, to have I$/D$ coherence, and to not require explicit flushing and invalidation. Or, say, to have caches that are implicitly coherent between threads (Core A stores to a location, Core B loads from that location, Core B sees what Core A stored).
The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence mechanisms tend to scale upwards as core counts increase.
Say, for example, if one has coherent caches, software that depends on
the cache-coherent behavior, and much more than 2 or 4 cores, it is not difficult to imagine scenarios where waiting on cache-coherence
mechanisms becomes a more significant cost than actual memory-transfer bandwidth on the bus.
Say, typical scenario with incoherent caches:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache sends a copy to B.
A and B now have incoherent copies.
Versus Say:--- Synchronet 3.22a-Linux NewsLink 1.2
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache rejects B's Request;
L2 Cache sense a request to A to write line back;
Core A writes line back (flushing it locally);
(Maybe) L2 signals to Core B that the line is now available.
Core B Requests Line again (retry);
L2 Cache sends a copy to B.
BGB <cr88192@gmail.com> posted:
On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.
Also the Motorola 68040.
Upon updating instructions (e.g., from a JIT compiler), they require
that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.
One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say rCLself-modifying coderCY as soon as I
mentioned this, even though the two are quite different things.
I think itrCOs quite desirable that an architecture guarantees that an
(to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same
memory location will never show different values.
Sometimes, there is a difference between "nice to have", vs "cost
effective".
It is nicer, say, to have I$/D$ coherence, and to not require explicit
flushing and invalidation. Or, say, to have caches that are implicitly
coherent between threads (Core A stores to a location, Core B loads from
that location, Core B sees what Core A stored).
The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence
mechanisms tend to scale upwards as core counts increase.
Say, for example, if one has coherent caches, software that depends on
the cache-coherent behavior, and much more than 2 or 4 cores, it is not
difficult to imagine scenarios where waiting on cache-coherence
mechanisms becomes a more significant cost than actual memory-transfer
bandwidth on the bus.
Say, typical scenario with incoherent caches:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache sends a copy to B.
A and B now have incoherent copies.
Try the above using an incoherent TLB model !!!
Oh and BTW, the same arguments for cache coherence argues to for
TLBs coherence. It is just easier for everyone.
Versus Say:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache rejects B's Request;
L2 Cache sense a request to A to write line back;
Core A writes line back (flushing it locally);
(Maybe) L2 signals to Core B that the line is now available.
Core B Requests Line again (retry);
L2 Cache sends a copy to B.
It is nicer, say, to have I$/D$ coherence, and to not require explicit flushing and invalidation. Or, say, to have caches that are implicitly coherent between threads (Core A stores to a location, Core B loads from that location, Core B sees what Core A stored).
The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence mechanisms tend to scale upwards as core counts increase.
BGB <cr88192@gmail.com> writes:
It is nicer, say, to have I$/D$ coherence, and to not require explicit
flushing and invalidation. Or, say, to have caches that are implicitly
coherent between threads (Core A stores to a location, Core B loads from
that location, Core B sees what Core A stored).
The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence
mechanisms tend to scale upwards as core counts increase.
OTOH, again and again, I found that all the edge cases of explicit, software-driven coherence required pessimistic assumptions which were
slower than leaning on hardware. Pick the tiny subset of SW behaviors
which you'll support--and include enforcement--or else be prepared for
the steep slope downward.
Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
No AI was used in the composition of this message
Paul Clayton <paaronclayton@gmail.com> posted:
On 5/11/26 2:39 AM, Anton Ertl wrote:
Paul Clayton <paaronclayton@gmail.com> writes:
With respect to Stefan Monnier's seeing this as undefined
behavior, I think this might be presented similarly to memory
ordering with a weaker memory model.
Is that supposed to be a defense of the discard instruction? It
isn't. Descriptions of weak memory models are full of "undefined
behaviour".
Yes, or at least a statement that if a weaker memory model is
accepted (which I think Mitch chose for My 66000) then such
undefined behavior could be acceptable.
My 66000 memory model is rather strong and depends on memory type
and whether an ATOMIC event is in progress.
Configuration space access is strongly ordered
MMI/O access is sequentially consistent
ROM access is completely unordered
Cacheable access with no ATOMIC is causally consistent
Cacheable access with ATOMIC is sequentially consistent
Weak memory models are a bad idea like many supercomputer ideas (e.g.,
division with wrong results,
You seem to be categorizing BGB ISA as a supercomputer.
or imprecise exceptions),
Or none at all {many RISCs}
On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.
Also the Motorola 68040.
Upon updating instructions (e.g., from a JIT compiler), they require
that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.
One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say rCLself-modifying coderCY as soon as I
mentioned this, even though the two are quite different things.
I think itrCOs quite desirable that an architecture guarantees that an
(to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same
memory location will never show different values.
Sometimes, there is a difference between "nice to have", vs "cost effective".
It is nicer, say, to have I$/D$ coherence, and to not require explicit flushing and invalidation. Or, say, to have caches that are implicitly coherent between threads (Core A stores to a location, Core B loads from that location, Core B sees what Core A stored).
The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence mechanisms tend to scale upwards as core counts increase.
Say, for example, if one has coherent caches, software that depends on
the cache-coherent behavior, and much more than 2 or 4 cores, it is not difficult to imagine scenarios where waiting on cache-coherence
mechanisms becomes a more significant cost than actual memory-transfer bandwidth on the bus.
Say, typical scenario with incoherent caches:
-a Core A Requests Line (for Write);
-a Core B Requests Line (also for Write);
-a L2 Cache sends a copy to A;
-a L2 Cache sends a copy to B.
A and B now have incoherent copies.
Versus Say:
-a Core A Requests Line (for Write);
-a Core B Requests Line (also for Write);
-a L2 Cache sends a copy to A;
-a L2 Cache rejects B's Request;
-a L2 Cache sense a request to A to write line back;
-a Core A writes line back (flushing it locally);
-a (Maybe) L2 signals to Core B that the line is now available.
-a Core B Requests Line again (retry);
-a L2 Cache sends a copy to B.
In my approach, I went with incoherent caches, but with a special
Volatile mechanism for some cases, say:
-a Core A requests a line for Volatile Write;
-a Core B Requests Line (also for Volatile Write);
-a L2 Cache sends a copy to A;
-a L2 Cache ignores B's Request (it can cycle the ring some more);
-a-a-a L2 cache can track volatile lines and see that it is in-use.
-a Core A writes back line and flushes local copy;
-a-a-a L2 cache then marks the volatile access as complete.
-a L2 Cache sends a copy to B
-a-a-a Via the original request cycling around and hitting L2 again
-a Core B writes back line and flushes local copy;
-a-a-a L2 cache then marks the volatile access as complete.
Because volatile accesses flush the cached dirty lines immediately, this means that there is a performance penalty, but these accesses can remain coherent (but without the impact of trying to make all memory coherent).
For something like an inter-processor JIT, this would alas still require flushing the L1 caches in a way that is coordinated between threads.
Normally, the mutex mechanism does not include I$ flushes, though one possibility could be to have, say, a separate JIT mutex lock, where if threads (upon trying to lock a mutex) see a JIT Sequence Number that
does not match the expected value for that mutex on that processor core,
it also triggers an I$ flush.
Say:
-a JIT Lock:
-a-a-a Flush Caches;
-a-a-a Lock Mutex;
-a-a-a Increment JIT Sequence Number (JSN).
-a Do stuff;
-a Flush Caches;
-a Unlock Mutex;
-a-a-a Flush Caches;
-a-a-a Set mutex to unlocked.
-a Lock Mutex (Normal):
-a-a-a Flush Caches;
-a-a-a Lock Mutex;
-a-a-a Check JSN against cores' current JSN;
-a-a-a-a-a If mismatch, flush I$ and update core's JSN.
-a-a-a-a-a Likely all via CPUID and a lookup table, not new arch.
-a Do Stuff;
-a Unlock Mutex:
-a-a-a ...
-a ...
Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.
Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.
Can't the GBOoO send the LD out early/speculatively, and do a kind of >branch-recovery if that memory location is later modified is a way that >changes what the LD should have received?
Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious
to me that it would necessarily be impractical.
Stefan Monnier <monnier@iro.umontreal.ca> writes:
Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.
Can't the GBOoO send the LD out early/speculatively, and do a kind of >branch-recovery if that memory location is later modified is a way that >changes what the LD should have received?
Of course it can, although I would not call it "branch" recovery.
The person you cited without attribution (to protect the guilty?)
exhibits what I called the laziness of hardware designers: Instead of thinking how to implement sequential consistency efficiently, they
think about rationalizations for not doing so.
Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious
to me that it would necessarily be impractical.
Yes, the whole architectural state of the core would have to be reset.
The major challenge for using the classical implementation of
speculative execution (with, register renaming, speculative store
buffer, and reorder buffer) is the worst-case latency of inter-core communication. E.g., I see at <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
(multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
range, and I expect that if an architecture provides sequential
consistency, there are more incentives to bring that latency number
down. OTOH, with multi-socket machines, the latency tends to be
higher. Anyway, let's work with the 90ns number. That's about 500
cycles at the higher Zen5 clock rates, and is 4000 potential
instruction slots; the Zen5 ROB only has 448 entries, so one probably
will not extend the ROB approach to deal with sequential consistency.
A snapshot-and-recovery mechanism might work, based on epochs on the
order of the maximum communication latency.
Then we have to think about how to prevent (not mitigate) Spectre for
such a mechanism; yes, hardware designers currently don't do anything
about preventing Spectre, and they probably will not do anything if
they ever implement sequential consistency, but I think they should,
and so I also think that one needs a way to implement sequential
consistency efficiently that can be combined with an efficient
prevention of Spectre. Note how speculative side channel attacks were
the final death sentence for TSX.
Concerning performance costs, whenever a conflict is detected, one way
of recovery would be to reset all cores to the architectural state of
the last snapshot before the conflict happened.
One can probably find
less draconic ways to ensure consistency, but I consider them to be optimizations. One optimization might be to predict the conflict and
hold back the corresponding load such that no conflict happens and no
reset is necessary.
Another might be to find out which cores
communicate, and only reset those that have talked to each other since
the snapshot.
- anton--- Synchronet 3.22a-Linux NewsLink 1.2
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.
Can't the GBOoO send the LD out early/speculatively, and do a kind of
branch-recovery if that memory location is later modified is a way that
changes what the LD should have received?
Of course it can, although I would not call it "branch" recovery.
The person you cited without attribution (to protect the guilty?)
exhibits what I called the laziness of hardware designers: Instead of
thinking how to implement sequential consistency efficiently, they
think about rationalizations for not doing so.
Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious >>> to me that it would necessarily be impractical.
Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as
a failure. How does one recover that ??
So, yes, you can recover this CPU's state, but no, you cannot precisely recover the other CPU's state precisely.
Yes, the whole architectural state of the core would have to be reset.
The major challenge for using the classical implementation of
speculative execution (with, register renaming, speculative store
buffer, and reorder buffer) is the worst-case latency of inter-core
communication. E.g., I see at
<https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
(multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
range, and I expect that if an architecture provides sequential
consistency, there are more incentives to bring that latency number
down. OTOH, with multi-socket machines, the latency tends to be
higher. Anyway, let's work with the 90ns number. That's about 500
cycles at the higher Zen5 clock rates, and is 4000 potential
instruction slots; the Zen5 ROB only has 448 entries, so one probably
will not extend the ROB approach to deal with sequential consistency.
A snapshot-and-recovery mechanism might work, based on epochs on the
order of the maximum communication latency.
And that only recovers the state, not the intent of the state (above).
Then we have to think about how to prevent (not mitigate) Spectre for
such a mechanism; yes, hardware designers currently don't do anything
about preventing Spectre, and they probably will not do anything if
they ever implement sequential consistency, but I think they should,
and so I also think that one needs a way to implement sequential
consistency efficiently that can be combined with an efficient
prevention of Spectre. Note how speculative side channel attacks were
the final death sentence for TSX.
Given that ST to LD ordering is an inherent part of SC, a SC machine
will not be able to use as large an execution window as a Casually
Consistent machine.
MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.
Can't the GBOoO send the LD out early/speculatively, and do a kind of
branch-recovery if that memory location is later modified is a way that >>> changes what the LD should have received?
Of course it can, although I would not call it "branch" recovery.
The person you cited without attribution (to protect the guilty?)
exhibits what I called the laziness of hardware designers: Instead of
thinking how to implement sequential consistency efficiently, they
think about rationalizations for not doing so.
Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious >>> to me that it would necessarily be impractical.
Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as a failure. How does one recover that ??
So, yes, you can recover this CPU's state, but no, you cannot precisely recover the other CPU's state precisely.
Yes, the whole architectural state of the core would have to be reset.
The major challenge for using the classical implementation of
speculative execution (with, register renaming, speculative store
buffer, and reorder buffer) is the worst-case latency of inter-core
communication. E.g., I see at
<https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
(multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
range, and I expect that if an architecture provides sequential
consistency, there are more incentives to bring that latency number
down. OTOH, with multi-socket machines, the latency tends to be
higher. Anyway, let's work with the 90ns number. That's about 500
cycles at the higher Zen5 clock rates, and is 4000 potential
instruction slots; the Zen5 ROB only has 448 entries, so one probably
will not extend the ROB approach to deal with sequential consistency.
A snapshot-and-recovery mechanism might work, based on epochs on the
order of the maximum communication latency.
And that only recovers the state, not the intent of the state (above).
Then we have to think about how to prevent (not mitigate) Spectre for
such a mechanism; yes, hardware designers currently don't do anything
about preventing Spectre, and they probably will not do anything if
they ever implement sequential consistency, but I think they should,
and so I also think that one needs a way to implement sequential
consistency efficiently that can be combined with an efficient
prevention of Spectre. Note how speculative side channel attacks were
the final death sentence for TSX.
Given that ST to LD ordering is an inherent part of SC, a SC machine
will not be able to use as large an execution window as a Casually Consistent machine.
Casually -> Causally ?
Terje
On 5/14/2026 10:22 AM, BGB wrote:
On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.
Also the Motorola 68040.
Upon updating instructions (e.g., from a JIT compiler), they require
that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.
One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say rCLself-modifying coderCY as soon as I
mentioned this, even though the two are quite different things.
I think itrCOs quite desirable that an architecture guarantees that an
(to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same
memory location will never show different values.
Sometimes, there is a difference between "nice to have", vs "cost
effective".
It is nicer, say, to have I$/D$ coherence, and to not require explicit
flushing and invalidation. Or, say, to have caches that are implicitly
coherent between threads (Core A stores to a location, Core B loads
from that location, Core B sees what Core A stored).
The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence
mechanisms tend to scale upwards as core counts increase.
Say, for example, if one has coherent caches, software that depends on
the cache-coherent behavior, and much more than 2 or 4 cores, it is
not difficult to imagine scenarios where waiting on cache-coherence
mechanisms becomes a more significant cost than actual memory-transfer
bandwidth on the bus.
Say, typical scenario with incoherent caches:
-a-a Core A Requests Line (for Write);
-a-a Core B Requests Line (also for Write);
-a-a L2 Cache sends a copy to A;
-a-a L2 Cache sends a copy to B.
A and B now have incoherent copies.
Versus Say:
-a-a Core A Requests Line (for Write);
-a-a Core B Requests Line (also for Write);
-a-a L2 Cache sends a copy to A;
-a-a L2 Cache rejects B's Request;
-a-a L2 Cache sense a request to A to write line back;
-a-a Core A writes line back (flushing it locally);
-a-a (Maybe) L2 signals to Core B that the line is now available.
-a-a Core B Requests Line again (retry);
-a-a L2 Cache sends a copy to B.
In my approach, I went with incoherent caches, but with a special
Volatile mechanism for some cases, say:
-a-a Core A requests a line for Volatile Write;
-a-a Core B Requests Line (also for Volatile Write);
-a-a L2 Cache sends a copy to A;
-a-a L2 Cache ignores B's Request (it can cycle the ring some more);
-a-a-a-a L2 cache can track volatile lines and see that it is in-use.
-a-a Core A writes back line and flushes local copy;
-a-a-a-a L2 cache then marks the volatile access as complete.
-a-a L2 Cache sends a copy to B
-a-a-a-a Via the original request cycling around and hitting L2 again
-a-a Core B writes back line and flushes local copy;
-a-a-a-a L2 cache then marks the volatile access as complete.
Because volatile accesses flush the cached dirty lines immediately,
this means that there is a performance penalty, but these accesses can
remain coherent (but without the impact of trying to make all memory
coherent).
For something like an inter-processor JIT, this would alas still
require flushing the L1 caches in a way that is coordinated between
threads.
Normally, the mutex mechanism does not include I$ flushes, though one
possibility could be to have, say, a separate JIT mutex lock, where if
threads (upon trying to lock a mutex) see a JIT Sequence Number that
does not match the expected value for that mutex on that processor
core, it also triggers an I$ flush.
Say:
-a-a JIT Lock:
-a-a-a-a Flush Caches;
-a-a-a-a Lock Mutex;
-a-a-a-a Increment JIT Sequence Number (JSN).
-a-a Do stuff;
-a-a Flush Caches;
-a-a Unlock Mutex;
-a-a-a-a Flush Caches;
-a-a-a-a Set mutex to unlocked.
-a-a Lock Mutex (Normal):
-a-a-a-a Flush Caches;
Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire. and #LoadStore | #StoreStore for release. No #StoreLoad ordering.
-a-a-a-a Lock Mutex;
-a-a-a-a Check JSN against cores' current JSN;
-a-a-a-a-a-a If mismatch, flush I$ and update core's JSN.
-a-a-a-a-a-a Likely all via CPUID and a lookup table, not new arch.
-a-a Do Stuff;
-a-a Unlock Mutex:
-a-a-a-a ...
-a-a ...
On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:-------------------Why can't BGB clip unnecessary lines in the thread ???
On 5/14/2026 10:22 AM, BGB wrote:
On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
Cache Flushing on Mutex Lock:
Anything that was in-memory is now written back;
Cache is ready to accept new (non-stale data).
Cache Flush on Mutex Unlock:
Anything dirty in cache during time mutex was held is now written back;
...
This causes mutex lock/unlock to become a sort of memory ordering event.
It is sort of needed for a weak model to work for multi-core
multi-threading and not just end up exploding (and some practices will
still not work as they would on a core with stronger memory ordering and cache coherence).
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 65 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 01:48:04 |
| Calls: | 862 |
| Files: | 1,311 |
| D/L today: |
10 files (20,373K bytes) |
| Messages: | 264,321 |