• Should an ISA contain

    From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 8 23:34:21 2026
    From Newsgroup: comp.arch


    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri May 8 19:52:53 2026
    From Newsgroup: comp.arch

    On 2026-May-08 19:34, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Do you mean changes a single line from using write-invalidate
    protocol to write-update so any remote writes are forwarded
    by the home directory to the current line owner?
    In effect, blocks line movement but not updates.

    Or something else?

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    Not unprivileged or applications could un-zero fields that had
    been intentionally zeroed out but still held in cache.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri May 8 21:36:56 2026
    From Newsgroup: comp.arch

    On 5/8/2026 6:34 PM, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}


    At present, this sounds pretty close to my default/weak memory semantics.


    I was left to consider adding something for temporary line locking for volatile write operations for stronger ordering constraints, but this is
    not currently the default (so it is more just "hope that access timing
    works in ones' favor" regarding volatile).

    In this case, when doing a volatile operation, it would first load the
    line with a locking flag, and then when the operation completes it
    either writes back or sends a message to release the line. When a line
    is locked, the L2 cache or similar will not give it over to another core
    that tries to request the same line for volatile access (but may still
    allow it for non-volatile access).


    Though, if multiple cores were to write to the same area of memory
    without explicit synchronization/flushing, then memory coherence issues
    could result. Current rule is mostly "don't do this".

    But, arguably, this is one of the faster/cheaper ways to do memory, even
    if the one most likely to result in unintended memory coherence issues.



    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    I wouldn't expect so. If it existed it would likely be very niche.




    Otherwise, most of my most recent efforts have been going more into
    working on my documentation, and finding/fixing some bugs in BGBCC.

    For a long time, there were demo desync issues in ROTT, which I
    eventually found the cause:
    If a local array was initialized, it would only initialize the items
    from the initializer list and leave the rest of the array uninitialized.


    Say:
    int arr[10]={1,2,3,4,5};
    Would initialize items 0..4 but leave 5..9 holding garbage.

    I went and fixed this, and now ROTT seems to behave more consistently.


    Ironically, this was just after working some on the memset builtin
    mechanism to be more effective (adding a memset slide similar to the
    existing memcpy slide).

    At present:
    1..95 bytes: Handled inline (reduced from 128);
    96..512 bytes: Uses a newly added memset slide;
    513+: call the generic memset function.

    Memset of 1..512 will use the slide if the value is non-zero.

    The memset slide is analogous to the memcpy slide, where it can encode a branch into somewhere in the slide (for coarse memset), and the location
    in the slide controls how much memory gets zeroed. Then, there are finer
    entry points, which generally fill in the final fractional bytes before branching into the main slide (for any bulk zero'ing).

    In this case, the built-in memset mechanism is being used to
    zero-initialize local arrays before setting up the other members.

    Compiler isn't currently smart enough to do bulk initialization though.
    char arr[16]="SOMESTRING";
    Will currently initialize the array using a series of byte stores (it
    sets each array member individually).
    Might be faster, arguably, to transform this case into the equivalent of
    doing an inline memcpy from the string literal, but this particular
    situation is infrequent.

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 9 18:44:55 2026
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    On 2026-May-08 19:34, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Do you mean changes a single line from using write-invalidate
    protocol to write-update so any remote writes are forwarded
    by the home directory to the current line owner?

    The line in a Exclusive or Modified state downgrades to a line
    in the Shared state {while the line remains resident}. If the
    line is no longer present, the instruction does nothing.

    In effect, blocks line movement but not updates.

    In a directory system, the directory knows that the line
    is shared in every cache it is present in.

    Or something else?

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    Not unprivileged or applications could un-zero fields that had
    been intentionally zeroed out but still held in cache.

    Allowing optimistic SW updates that can be reverted.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sat May 9 23:27:56 2026
    From Newsgroup: comp.arch

    MitchAlsup [2026-05-08 23:34:21] wrote:
    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    I can't see how you can add that without introducing undefined
    behavior into your ISA, so that's a clear no for me.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun May 10 02:04:51 2026
    From Newsgroup: comp.arch

    On 2026-05-08 7:34 p.m., MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Trying to fathom what is going on with this. Is it an issue with keeping
    the cache coherent? Sounds like the D$ cache line was write-protected
    and now it is to be made writable?


    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    Q+ has several I$ and D$ cache operations wrapped up in a single
    instruction called rCyCACHErCO. I thought it best to put these in one instruction since they are infrequently used. The instruction has the
    same format as a load/store but the source/dest register is replaced by
    a command code. It uses the supplied address (if an address is needed).

    Turn cache on/off (D$ only)
    Invalidate entire cache (I$ or D$ or both)
    Invalidate cache line (I$ or D$ or both)
    Invalidate TLB
    Invalidate TLB entry


    Both the I$ and D$ caches can be invalidated with a single instruction.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 10 17:26:11 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-05-08 7:34 p.m., MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Trying to fathom what is going on with this. Is it an issue with keeping
    the cache coherent? Sounds like the D$ cache line was write-protected
    and now it is to be made writable?

    Consider the stack, and after adding a number to SP there are now
    a bunch of lines that are neither accessible nor containing a useful
    value.


    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    Q+ has several I$ and D$ cache operations wrapped up in a single
    instruction called rCyCACHErCO. I thought it best to put these in one instruction since they are infrequently used. The instruction has the
    same format as a load/store but the source/dest register is replaced by
    a command code. It uses the supplied address (if an address is needed).

    I have the same format, a memory reference that does not need a DST register specifier, so it becomes the OpCode.

    Turn cache on/off (D$ only)

    Why would you want the cache turned off??

    Invalidate entire cache (I$ or D$ or both)

    What if the cache is 1GB in size ??? This could take a long time.

    Invalidate cache line (I$ or D$ or both)
    Invalidate TLB

    With a coherent TLB this is unnecessary.

    Invalidate TLB entry


    Both the I$ and D$ caches can be invalidated with a single instruction.

    That may take a long time !
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 10 21:03:48 2026
    From Newsgroup: comp.arch


    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-05-08 7:34 p.m., MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Trying to fathom what is going on with this. Is it an issue with keeping the cache coherent? Sounds like the D$ cache line was write-protected
    and now it is to be made writable?

    Consider the stack, and after adding a number to SP there are now
    a bunch of lines that are neither accessible nor containing a useful
    value.


    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    Q+ has several I$ and D$ cache operations wrapped up in a single instruction called rCyCACHErCO. I thought it best to put these in one instruction since they are infrequently used. The instruction has the
    same format as a load/store but the source/dest register is replaced by
    a command code. It uses the supplied address (if an address is needed).

    I have the same format, a memory reference that does not need a DST register specifier, so it becomes the OpCode.

    I broke the instruction into 3 sub-groups::
    a) prefetch
    b) invalidate
    c) post-push

    Prefetch brings data closer to the CPU (caches) and provides a specifier
    to which cache {I$, D$, L2, L3} and whether one wants write permission
    (or not).

    Invalidate gets rid of cached data without writing back.

    Post-Push pushes modified data farther from the PCU caches.

    I launched this topic because I can put as many as 32 instructions in this sub-group, and after months of thinking, I only found 19 to put there.
    {yes this violated the R in RISC should me reduced}

    Turn cache on/off (D$ only)

    Why would you want the cache turned off??

    Invalidate entire cache (I$ or D$ or both)

    What if the cache is 1GB in size ??? This could take a long time.

    Invalidate cache line (I$ or D$ or both)
    Invalidate TLB

    With a coherent TLB this is unnecessary.

    Invalidate TLB entry


    Both the I$ and D$ caches can be invalidated with a single instruction.

    That may take a long time !
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun May 10 18:45:34 2026
    From Newsgroup: comp.arch

    On 5/8/26 7:34 PM, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Intel added Cache Line WriteBack to memory to help with memory
    persistence (IIRC), which can be viewed as a reliability
    assertion (data will not be lost on power failure). There could
    also be performance reasons for pushing data outward while
    retaining it locally in a clean (shared) state; a remote request
    for the data might have lower latency (sourcing directly from
    L3, e.g., rather than an L3 coherence directory indicating where
    the data is and having to request the data from and a state
    change for the owner).

    For L1 to L2, cache line granularity might be too fine for
    'checkpointing' data from a merely parity protected L1 to an
    ECC-protected L2, though My 66000's VVM (with appropriate
    acceleration) might make substantial blocks fast/low overhead.
    On the other hand, assigning reliability factor at a page level
    might be awkward from PTE bit starvation, granularity
    inflexibility, and timing.

    Would this also ensure data presence in outer cache/memory on a
    clean line? E.g., if applied with an L2 target when L2 is non-
    inclusive (but possibly tag inclusive or at least snoop
    filtering) and the line is clean, would the line be written back
    if not present in L2?

    If one had a mode that disallowed escape of dirty lines, this
    might be used as a means to commit temporary, local values. This
    seems somewhat similar to a transactional memory mechanism,
    though transactional memory would typically distinguish old
    dirty lines (and perhaps clean ones) allowing them to be written
    back on replacement.

    I also wonder if this might be used to assist in determining
    what cache indexes have been replaced in L2. With lazy writeback
    the timing factors may be fuzzed more. My mind does not work
    well for this type of problem.

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    EricP pointed out a possible security issue if OS page zeroing
    could be thwarted. This could be worked around by having such
    page (or cache line) zeroing use special cases that act as if
    the zeroed memory was written back to memory. Forcing a
    distinguishing between explicit zeroing to provide a base value
    and zeroing to remove access to old data may facilitate software
    bugs when the difference is not recognized/remembered.

    This is similar to the problem that data cache block allocate
    had where old data (that the current thread was not permitted to
    read) of a possibly different address could be read. This was
    generally "solved" by defining allocation as either no-op on a
    cache hit and cache block zero on a miss. Since the benefit of
    doing nothing on a cache hit may not have been very beneficial
    (one might use a bit of cache bandwidth) and zeroing provides
    other benefits, block zeroing seems to be preferred (though I
    still like allocation).

    (This also is reminiscent of the Mill's unbacked memory, which
    was memory that reads as zero [providing an implicit data cache
    block zero] and has no physical memory address until evicted
    from last level cache. For highly temporary data, the data
    would never leave the cache; this could also allow cache as
    memory as long as no cache was forced to be written back. I do
    not know if unbacked memory allowed an application to release
    the memory, which would be like an invalidate without
    writeback.)

    Optimistic updates sounds similar to transactional memory or
    versioned memory.

    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model. I.e., the result of a read
    would still return a previously held value, but the "version"
    might be unexpected. The result is not "undefined" but timing
    dependent.

    I suspect one would have to be very careful about defining how
    such would interact with ESM (and perhaps other memory
    interaction methods).

    If the memory so cleared is thread local, I do not _think_
    there would be consistency issues. (I think IBM defined "local"
    memory transactions which supported speculation but not system
    atomicity.) Yet I feel that there might be uses for value
    checkpointing (versioning) where the address is shared by
    multiple threads.

    Obviously, hardware could in some cases interweave versions into
    a consistent order, but forcing software to handle the cases
    when hardware fails sounds problematic. Explicit checkpoints
    like with transactional memory, might be easier for programmers
    to use correctly than a fully flexible handling of speculation.
    On the other hand, finer-grained control could allow software
    to exploit knowledge that is not easily observed by (or
    communicated to) hardware.

    I think there are opportunities for versioned memory and/or
    other timing/speculation manipulation, but I do not have a clue
    about what interface should be presented to software. A RISC-
    like approach of cache line control instructions could provide
    flexibility, but the overhead for idiom recognition should also
    be considered.

    Modal operation (like transactional memory or ASM) simplifies
    some aspects and complicates others.

    I tend to favor complexity (flexibility), so my opinion is
    dangerous.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 11 00:33:55 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/8/26 7:34 PM, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Intel added Cache Line WriteBack to memory to help with memory
    persistence (IIRC), which can be viewed as a reliability
    assertion (data will not be lost on power failure).

    Thank you Paul ! An interesting rational

    There could
    also be performance reasons for pushing data outward while
    retaining it locally in a clean (shared) state; a remote request
    for the data might have lower latency (sourcing directly from
    L3, e.g., rather than an L3 coherence directory indicating where
    the data is and having to request the data from and a state
    change for the owner).

    In a directory based caching system the directory is in a position
    that ANY shared cache line can be granted into the Exclusive state
    {minimizing transfer distance}.

    For L1 to L2, cache line granularity might be too fine for
    'checkpointing' data from a merely parity protected L1 to an
    ECC-protected L2, though My 66000's VVM (with appropriate
    acceleration) might make substantial blocks fast/low overhead.

    If you care about RAS, you cannot have write back L1 caches
    with that property.

    On the other hand, assigning reliability factor at a page level
    might be awkward from PTE bit starvation, granularity
    inflexibility, and timing.

    Would this also ensure data presence in outer cache/memory on a
    clean line? E.g., if applied with an L2 target when L2 is non-
    inclusive (but possibly tag inclusive or at least snoop
    filtering) and the line is clean, would the line be written back
    if not present in L2?

    A whole different can or worms.....

    If one had a mode that disallowed escape of dirty lines, this
    might be used as a means to commit temporary, local values. This
    seems somewhat similar to a transactional memory mechanism,
    though transactional memory would typically distinguish old
    dirty lines (and perhaps clean ones) allowing them to be written
    back on replacement.

    Luckily, I have a fundamental disagreement on ISA-extensions that
    provide SW the illusion that "lots or places" can be in intermediate
    states (i.e. TM).

    I also wonder if this might be used to assist in determining
    what cache indexes have been replaced in L2. With lazy writeback
    the timing factors may be fuzzed more. My mind does not work
    well for this type of problem.

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    EricP pointed out a possible security issue if OS page zeroing
    could be thwarted.

    Why is the OS zeroing a page that has already been mapped into
    unprivileged VAS ???

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one
    transaction.

    This could be worked around by having such
    page (or cache line) zeroing use special cases that act as if
    the zeroed memory was written back to memory. Forcing a
    distinguishing between explicit zeroing to provide a base value
    and zeroing to remove access to old data may facilitate software
    bugs when the difference is not recognized/remembered.

    This is similar to the problem that data cache block allocate
    had where old data (that the current thread was not permitted to
    read) of a possibly different address could be read. This was
    generally "solved" by defining allocation as either no-op on a
    cache hit and cache block zero on a miss.

    vVM is allowed to 'allocate' cache lines (CI without Read) when
    a line boundary is crossed and more than 1 complete line remains
    in the loop--saving interconnect BW and coherence messages.

    Since the benefit of
    doing nothing on a cache hit may not have been very beneficial
    (one might use a bit of cache bandwidth) and zeroing provides
    other benefits, block zeroing seems to be preferred (though I
    still like allocation).

    (This also is reminiscent of the Mill's unbacked memory, which
    was memory that reads as zero [providing an implicit data cache
    block zero] and has no physical memory address until evicted
    from last level cache.

    I always liked that feature. I count not work it into a more
    conventional architecture, except for the 'known' program stack.

    For highly temporary data, the data
    would never leave the cache; this could also allow cache as
    memory as long as no cache was forced to be written back. I do
    not know if unbacked memory allowed an application to release
    the memory, which would be like an invalidate without
    writeback.)

    Known stack.

    Optimistic updates sounds similar to transactional memory or
    versioned memory.

    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model. I.e., the result of a read
    would still return a previously held value, but the "version"
    might be unexpected. The result is not "undefined" but timing
    dependent.

    SW would consider this undefined--SW depends (way too much) of
    a read returning exactly the last thing read.

    I suspect one would have to be very careful about defining how
    such would interact with ESM (and perhaps other memory
    interaction methods).

    Any kind of ATOMIC thing is WAY better to do it correct and
    SLOW than to take ANY chance of doing it wrong.

    If the memory so cleared is thread local, I do not _think_
    there would be consistency issues. (I think IBM defined "local"
    memory transactions which supported speculation but not system
    atomicity.) Yet I feel that there might be uses for value
    checkpointing (versioning) where the address is shared by
    multiple threads.

    Obviously, hardware could in some cases interweave versions into
    a consistent order, but forcing software to handle the cases
    when hardware fails sounds problematic. Explicit checkpoints
    like with transactional memory, might be easier for programmers
    to use correctly than a fully flexible handling of speculation.
    On the other hand, finer-grained control could allow software
    to exploit knowledge that is not easily observed by (or
    communicated to) hardware.

    I think there are opportunities for versioned memory and/or
    other timing/speculation manipulation, but I do not have a clue
    about what interface should be presented to software. A RISC-
    like approach of cache line control instructions could provide
    flexibility, but the overhead for idiom recognition should also
    be considered.

    Modal operation (like transactional memory or ASM) simplifies
    some aspects and complicates others.

    I tend to favor complexity (flexibility), so my opinion is
    dangerous.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun May 10 21:43:42 2026
    From Newsgroup: comp.arch

    On 5/10/26 8:33 PM, MitchAlsup wrote:

    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/8/26 7:34 PM, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Intel added Cache Line WriteBack to memory to help with memory
    persistence (IIRC), which can be viewed as a reliability
    assertion (data will not be lost on power failure).

    Thank you Paul ! An interesting rational

    Oops. I think the Optane-inspired instruction may have been
    CLFLUSHOPT, which may have been intended to allow faster
    controlled power down. (Even that guess may be wrong.)

    There could
    also be performance reasons for pushing data outward while
    retaining it locally in a clean (shared) state; a remote request
    for the data might have lower latency (sourcing directly from
    L3, e.g., rather than an L3 coherence directory indicating where
    the data is and having to request the data from and a state
    change for the owner).

    In a directory based caching system the directory is in a position
    that ANY shared cache line can be granted into the Exclusive state {minimizing transfer distance}.

    For L1 to L2, cache line granularity might be too fine for
    'checkpointing' data from a merely parity protected L1 to an
    ECC-protected L2, though My 66000's VVM (with appropriate
    acceleration) might make substantial blocks fast/low overhead.

    If you care about RAS, you cannot have write back L1 caches
    with that property.

    Different customers may have different preferences. I seem to
    recall that Intel offered the option to replicate parity-
    protected L1 data cache to allow recovery (at the cost of half
    the capacity).

    It might be possible to pay the area penalty for ECC but have
    modal configuration of whether ECC or parity is used. If the
    SRAM cells were made less reliable to improve density (I think I
    read that Intel used more reliable L1 cells to mitigate the
    parity-only effect), then turning off ECC would result in more
    frequent flakiness but one could avoid read-modify-write on sub-
    word accesses. Design choices that improve reliability will
    tend to hurt performance; acceptable reliability can be fairly
    low when software (and no-ECC DRAM) will provide more failures.
    (With newer DRAM standards seeming likely to add ECC, software
    and CPU memory system reliability may become more important.)

    On the other hand, assigning reliability factor at a page level
    might be awkward from PTE bit starvation, granularity
    inflexibility, and timing.

    Would this also ensure data presence in outer cache/memory on a
    clean line? E.g., if applied with an L2 target when L2 is non-
    inclusive (but possibly tag inclusive or at least snoop
    filtering) and the line is clean, would the line be written back
    if not present in L2?

    A whole different can or worms.....

    If one had a mode that disallowed escape of dirty lines, this
    might be used as a means to commit temporary, local values. This
    seems somewhat similar to a transactional memory mechanism,
    though transactional memory would typically distinguish old
    dirty lines (and perhaps clean ones) allowing them to be written
    back on replacement.

    Luckily, I have a fundamental disagreement on ISA-extensions that
    provide SW the illusion that "lots or places" can be in intermediate
    states (i.e. TM).

    I also wonder if this might be used to assist in determining
    what cache indexes have been replaced in L2. With lazy writeback
    the timing factors may be fuzzed more. My mind does not work
    well for this type of problem.

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    EricP pointed out a possible security issue if OS page zeroing
    could be thwarted.

    Why is the OS zeroing a page that has already been mapped into
    unprivileged VAS ???

    The OS zeros the physical page before assigning it to the new
    context (or more likely assigns a zero page and does copy on
    write, which is just zeroing the page). If the zeroed data is
    still present in cache, an invalidate without writeback would
    preserve the data in main memory with access by the new context.
    Forcing the OS to writeback (or writethrough at zeroing) such
    zeroing to memory would be bad for performance (especially with
    copy-on-write when the data will be dirtied again).

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one transaction.

    This is more flexible than having cache line and page clearing
    instructions.

    This could be worked around by having such
    page (or cache line) zeroing use special cases that act as if
    the zeroed memory was written back to memory. Forcing a
    distinguishing between explicit zeroing to provide a base value
    and zeroing to remove access to old data may facilitate software
    bugs when the difference is not recognized/remembered.

    This is similar to the problem that data cache block allocate
    had where old data (that the current thread was not permitted to
    read) of a possibly different address could be read. This was
    generally "solved" by defining allocation as either no-op on a
    cache hit and cache block zero on a miss.

    vVM is allowed to 'allocate' cache lines (CI without Read) when
    a line boundary is crossed and more than 1 complete line remains
    in the loop--saving interconnect BW and coherence messages.

    That may be the most common use for avoiding read-to-own, but it
    is not the only use.

    Since the benefit of
    doing nothing on a cache hit may not have been very beneficial
    (one might use a bit of cache bandwidth) and zeroing provides
    other benefits, block zeroing seems to be preferred (though I
    still like allocation).

    (This also is reminiscent of the Mill's unbacked memory, which
    was memory that reads as zero [providing an implicit data cache
    block zero] and has no physical memory address until evicted
    from last level cache.

    I always liked that feature. I count not work it into a more
    conventional architecture, except for the 'known' program stack.

    I think one could by defining a read-only physical page as read-
    as-zero and "allocate" on write. One would have to have a
    background process to maintain a free list (perhaps similar to a
    hypervisor with extremely limited functionality) and an
    interface for that free list manager to request new pages (so
    conventional OSes would need more porting effort).

    One could also use placeholder physical page addresses (a.k.a.,
    shadow memory; "Increasing TLB Reach Using Superpages Backed by
    Shadow Memory", Mark Swanson, Leigh Stoller, and John Carter,
    1998; the paper uses a large virtual physical address page,
    whose address is treated as physical by caches and TLBs and is
    translated before cache eviction into multiple smaller physical
    pages), effectively introducing another layer of address
    virtualization for a modest subset of the physical address
    space. This would allow "physical addresses" in the caches with
    a TLB for the relatively few unbacked pages to allow them to be
    in cache without having a memory address allocated.

    (Having this additional TLB layer near the memory controller or
    shared cache slice would (I think) either require replication or
    page-size constraints on address distribution.)

    Avoiding software (OS) copy-on-write of zero pages might not be
    a significant benefit given My 66000's relatively fast context
    switches, but other ISAs might benefit just from low cost
    secure page zeroing copy-on-write.

    (Large pages might be a problem. The Mill had the advantage of
    supporting a larger variety of page sizes than My 66000. Having
    hardware change address translations to merge pages would also
    violate traditional OS assumptions.)

    There might be tension in deciding when a page zeroing
    instruction should write to the cache or write to the TLB. (The
    Mill's use of virtually addressed caches and delayed TLB helped,
    but I do not think that was essential.)

    Maybe I am missing something or maybe the issues I mentioned are
    larger than I suspected.

    For highly temporary data, the data
    would never leave the cache; this could also allow cache as
    memory as long as no cache was forced to be written back. I do
    not know if unbacked memory allowed an application to release
    the memory, which would be like an invalidate without
    writeback.)

    Known stack.

    For an activation record stack, this is somewhat straightforward
    (and such also presents security-improving opportunities, which
    you mentioned My 66000 also added).

    Optimistic updates sounds similar to transactional memory or
    versioned memory.

    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model. I.e., the result of a read
    would still return a previously held value, but the "version"
    might be unexpected. The result is not "undefined" but timing
    dependent.

    SW would consider this undefined--SW depends (way too much) of
    a read returning exactly the last thing read.

    I suspect one would have to be very careful about defining how
    such would interact with ESM (and perhaps other memory
    interaction methods).

    Any kind of ATOMIC thing is WAY better to do it correct and
    SLOW than to take ANY chance of doing it wrong.

    Yes, though slow can also motivate incorrect software. Being
    able to clearly communicate the dangers also seems important
    (which argues for simplicity/orthogonality).

    If the memory so cleared is thread local, I do not _think_
    there would be consistency issues. (I think IBM defined "local"
    memory transactions which supported speculation but not system
    atomicity.) Yet I feel that there might be uses for value
    checkpointing (versioning) where the address is shared by
    multiple threads.

    Obviously, hardware could in some cases interweave versions into
    a consistent order, but forcing software to handle the cases
    when hardware fails sounds problematic. Explicit checkpoints
    like with transactional memory, might be easier for programmers
    to use correctly than a fully flexible handling of speculation.
    On the other hand, finer-grained control could allow software
    to exploit knowledge that is not easily observed by (or
    communicated to) hardware.

    I think there are opportunities for versioned memory and/or
    other timing/speculation manipulation, but I do not have a clue
    about what interface should be presented to software. A RISC-
    like approach of cache line control instructions could provide
    flexibility, but the overhead for idiom recognition should also
    be considered.

    Modal operation (like transactional memory or ASM) simplifies
    some aspects and complicates others.

    I tend to favor complexity (flexibility), so my opinion is
    dangerous.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 11 06:07:42 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Write permission is usually done at page granularity.

    But maybe your question is: Should there be an instruction that is a
    hint that a 64B block is not going to be written to in the forseeable
    future (so a microarchitecture with 64B cache lines would write that
    line back to main memory now instead of later)?

    The answer to that question is: maybe. The case for such an
    instruction seems weaker than for prefetch instructions, and the
    results from using prefetch instructions has been disappointing.

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory. Upon updating instructions (e.g., from a JIT compiler), they
    require that the modifying thread(s) write the lines back from the
    data cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on the executing threads are extremely difficult to meet if the executing
    threads run independently of the modifying thread(s). Or, in short,
    IA-32 and AMD64 did the right architecture for that.

    In any case, "architectures" with the deficiency described above
    necessarily have instructions that write data cache lines back to
    shared storage. In the case of Power this instruction is dcbst.
    While this instruction is documented, it refers to non-architected and implementation-dependent concepts like "cache line", i.e., it is not a
    properly architected instruction, and the cache synchronization code
    on Power is implementation-dependent.

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    No! What would the architectural meaning of such an instruction be?
    "Maybe restore some previous contents of this memory"? Does not sound
    useful at all. Not everything that can be done should be done.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 11 06:39:42 2026
    From Newsgroup: comp.arch

    Paul Clayton <paaronclayton@gmail.com> writes:
    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model.

    Is that supposed to be a defense of the discard instruction? It
    isn't. Descriptions of weak memory models are full of "undefined
    behaviour".

    Weak memory models are a bad idea like many supercomputer ideas (e.g.,
    division with wrong results, or imprecise exceptions), but unlike
    other bad ideas they have made it almost to general-purpose computing.
    And they are exactly bad ideas because:

    1) The unpredictable results if their restrictions are not heeded.

    2) The difficulty of heeding these restrictions by adding close to the
    minimum necessary strongifying instructions (e.g., memory barriers and
    atomic instructions). In particular, thanks to 1 there is no way to
    check the correctness of the placement of these instructions by
    testing.

    3) The extreme performance cost of the strongifying instructions, so
    when you use some simple scheme that guarantees correctness (e.g.,
    inserting a write barrier after every store and a read barrier before
    every load), the resulting program is extrememly slow.

    In the case of weak memory models the hardware designers have the
    excuse that they are too lazy to implement a strong memory model
    efficiently (although they typically frame it by showing the
    inefficiency of some lazy implementation of a strong memory model),
    and that not that big parts of the software actually communicate with
    other threads.

    But I think that the chilling effects of difficulties in inter-thread communication have kept that back. But difficulties already exist
    with sequential consistency; transactional memory looked like it might
    come to the rescue, but after the hype from about 20 years ago is now
    in the valley of disappointment.

    I.e., the result of a read
    would still return a previously held value, but the "version"
    might be unexpected. The result is not "undefined" but timing
    dependent.

    "Undefined behaviour" typically originally means something where the
    people specifying it have a good idea what can happen, but where it is
    too complex and has too little benefit to actually specify it. E.g.,
    an out-of-bounds access to an object resulted in actually accessing
    that memory, but what is there is not specified and is implementation-dependent, and it might be that the address is not
    accessible, and results in a trap (and it seems to me that everything
    that may result in a trap on some machines has been labeled "undefined behaviour" in C; note the fine differences between
    implementation-defined and undefined behaviour for shifts in C). Only
    later compiler shenanigans like "optimizing" a loop that performs an out-of-bounds access into an endless loop were introduced and
    justified with the specified undefined behaviour.

    If MY66000 ever becomes a popular architecture with many
    implementations, and if discard's effect has been specified as making
    any read access before a write access to any memory in the cache line "undefined behaviour", we may see implementations that implement
    discard with effects that do not reflect your expectations at all.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 11 07:29:32 2026
    From Newsgroup: comp.arch

    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/10/26 8:33 PM, MitchAlsup wrote:
    [...]
    The OS zeros the physical page before assigning it to the new
    context (or more likely assigns a zero page and does copy on
    write, which is just zeroing the page).

    Assigning a zero page for reading is a good idea. Copying that page
    on writing appears inefficient to me, because it needs to read the
    zero page into cache and write it to a newly allocated page.

    A better approach is to do just the writes. I think that zeroing the
    page on demand is a good approach, because then it is already in the
    D-cache, but AFAIK Linux actually zeros physical pages ahead of time
    typically on a separate (otherwise idle) core, and just maps one of
    those pages to the virtual page that needs to be written to. I wonder
    why Linux does that.

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one
    transaction.

    This is more flexible than having cache line and page clearing
    instructions.

    In what way is it more flexible? It is a page-clearing instruction.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon May 11 14:27:53 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-05-08 7:34 p.m., MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Trying to fathom what is going on with this. Is it an issue with keeping
    the cache coherent? Sounds like the D$ cache line was write-protected
    and now it is to be made writable?

    Consider the stack, and after adding a number to SP there are now
    a bunch of lines that are neither accessible nor containing a useful
    value.

    Seems to me that the code will certainly call another function
    almost immediately that will simply reuse the already
    present stack cache line; prematurely invalidating it will
    actually slow things down.

    I see no benefit in invalidating it pre-emptively.

    It would certainly cause problems for code that intentionally
    uses the soi disant "free" stack space in legal but unusual ways.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon May 11 14:32:17 2026
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/10/26 8:33 PM, MitchAlsup wrote:
    [...]
    The OS zeros the physical page before assigning it to the new
    context (or more likely assigns a zero page and does copy on
    write, which is just zeroing the page).

    Assigning a zero page for reading is a good idea. Copying that page
    on writing appears inefficient to me, because it needs to read the
    zero page into cache and write it to a newly allocated page.

    ArmV8 has the DC ZVA instruction to zero blocks of cache,
    specifically for this purpose.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 11 18:00:29 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/10/26 8:33 PM, MitchAlsup wrote:
    --------------------------
    If you care about RAS, you cannot have write back L1 caches
    with that property.

    Different customers may have different preferences. I seem to
    recall that Intel offered the option to replicate parity-
    protected L1 data cache to allow recovery (at the cost of half
    the capacity).

    Yet, the design team can afford 1 design and the FAB can afford
    1 mask set--regardless of the number of customers.


    ---------------
    vVM is allowed to 'allocate' cache lines (CI without Read) when
    a line boundary is crossed and more than 1 complete line remains
    in the loop--saving interconnect BW and coherence messages.

    That may be the most common use for avoiding read-to-own, but it
    is not the only use.

    It is the only one easy to recognize.
    ------------
    Any kind of ATOMIC thing is WAY better to do it correct and
    SLOW than to take ANY chance of doing it wrong.

    Yes, though slow can also motivate incorrect software. Being
    able to clearly communicate the dangers also seems important
    (which argues for simplicity/orthogonality).

    As do most CPU things.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 11 18:09:59 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/10/26 8:33 PM, MitchAlsup wrote:
    [...]
    The OS zeros the physical page before assigning it to the new
    context (or more likely assigns a zero page and does copy on
    write, which is just zeroing the page).

    Assigning a zero page for reading is a good idea. Copying that page
    on writing appears inefficient to me, because it needs to read the
    zero page into cache and write it to a newly allocated page.

    A better approach is to do just the writes. I think that zeroing the
    page on demand is a good approach, because then it is already in the
    D-cache,

    Because there is a pool of already zeroed pages (for COW) it may be in
    some other CPUs cache.

    but AFAIK Linux actually zeros physical pages ahead of time typically on a separate (otherwise idle) core, and just maps one of
    those pages to the virtual page that needs to be written to. I wonder
    why Linux does that.

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one
    transaction.

    This is more flexible than having cache line and page clearing >instructions.

    In what way is it more flexible? It is a page-clearing instruction.

    It is a memset() function as an instruction. Any size is acceptable.

    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 11 18:18:27 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-05-08 7:34 p.m., MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Trying to fathom what is going on with this. Is it an issue with keeping >> the cache coherent? Sounds like the D$ cache line was write-protected
    and now it is to be made writable?

    Consider the stack, and after adding a number to SP there are now
    a bunch of lines that are neither accessible nor containing a useful
    value.

    Seems to me that the code will certainly call another function
    almost immediately that will simply reuse the already
    present stack cache line; prematurely invalidating it will
    actually slow things down.

    I did not invalidate those lines, I just marked them that if they are
    replaced before becoming "in stack" again they can be dropped without
    being pushed farther out the memory hierarchy.

    I see no benefit in invalidating it pre-emptively.

    It would certainly cause problems for code that intentionally
    uses the soi disant "free" stack space in legal but unusual ways.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed May 13 07:02:03 2026
    From Newsgroup: comp.arch

    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory.

    Also the Motorola 68040.

    Upon updating instructions (e.g., from a JIT compiler), they require
    that the modifying thread(s) write the lines back from the data
    cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on
    the executing threads are extremely difficult to meet if the
    executing threads run independently of the modifying thread(s). Or,
    in short, IA-32 and AMD64 did the right architecture for that.

    One technique for implementing lexical binding and functions as
    first-class objects involves generating code at run-time. Some people
    would immediately gasp and say rCLself-modifying coderCY as soon as I
    mentioned this, even though the two are quite different things.

    I think itrCOs quite desirable that an architecture guarantees that an
    (to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same memory location will never show different values.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Wed May 13 18:46:01 2026
    From Newsgroup: comp.arch

    On 5/11/26 2:39 AM, Anton Ertl wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model.

    Is that supposed to be a defense of the discard instruction? It
    isn't. Descriptions of weak memory models are full of "undefined
    behaviour".

    Yes, or at least a statement that if a weaker memory model is
    accepted (which I think Mitch chose for My 66000) then such
    undefined behavior could be acceptable.

    Weak memory models are a bad idea like many supercomputer ideas (e.g., division with wrong results, or imprecise exceptions), but unlike
    other bad ideas they have made it almost to general-purpose computing.
    And they are exactly bad ideas because:

    1) The unpredictable results if their restrictions are not heeded.

    Is this not also true for sequential consistency without locks
    (i.e., with race conditions). If locks (or often challenging to
    design lockfree methods) are acceptable "restrictions", then why
    not allow memory barriers.

    If locks do not implement barriers themselves in a given
    programming language/library (which has performance costs so a
    "zero abstraction cost" language like C++ might want to avoid
    such), this would presumably add significant correctness
    risk/complexity.

    2) The difficulty of heeding these restrictions by adding close to the minimum necessary strongifying instructions (e.g., memory barriers and
    atomic instructions). In particular, thanks to 1 there is no way to
    check the correctness of the placement of these instructions by
    testing.

    I think hardware could provide detection of at least some race
    conditions. This could also help with debugging improper lock
    use.

    Using the minimal memory barrier strengths seems like using
    lock-free programming; there might be some cases that are
    trivial and safe, but I would tend to push such toward "experts
    only" and the performance gains would typically not be that
    great if "moderate" effort is made in hardware.

    Not being a hardware person, I do not know how expensive (in power-performance-area or design complexity) sequential
    consistency is relative to a more relaxed model with significant
    optimization of barriers. I _suspect_ such depends of the
    communication between threads (cores/caches), perhaps especially
    the scale of the system.

    3) The extreme performance cost of the strongifying instructions, so
    when you use some simple scheme that guarantees correctness (e.g.,
    inserting a write barrier after every store and a read barrier before
    every load), the resulting program is extrememly slow.

    This assumes that barriers are expensive. This is similar to
    assuming that context switches are expensive; it is a historical
    fact but not a technical necessity.

    *If* there is very little advantage to optimized barrier weaker
    consistency, then using a simpler abstraction would probably be
    a better choice.

    In the case of weak memory models the hardware designers have the
    excuse that they are too lazy to implement a strong memory model
    efficiently (although they typically frame it by showing the
    inefficiency of some lazy implementation of a strong memory model),
    and that not that big parts of the software actually communicate with
    other threads.

    While not wanting to do hard and underappreciated work is
    understandable and humans will tend to justify their choices to
    themselves and others using less than perfect argumentation, I
    think the above statement is too hostile to hardware designers.

    There may well be a "laziness" in not researching what the costs
    are in current software, how much a weaker memory model
    discourages new useful software techniques, and how complex and
    effective hardware mechanisms for providing stronger consistency
    models are. But laziness is generally a combination of fear,
    fatigue, and lack of motivation; adding complexity increases
    schedule risk (fear) and without customers pushing for a feature
    (and a belief that the feature can be delivered within time and
    financial budgets) motivation will be lower. I would not be
    surprised that there is also a fatigue factor.

    Calling a person lazy seems generally ineffective in reducing
    fear or increasing motivation much less reducing fatigue. I
    suspect this strategy is also not very effective at an
    organization level.

    On a personal level, one can even know that fears are
    irrational, that most of the fatigue is psychologically induced,
    and that getting something done is usually good yet be unable to
    act.

    Incremental exposure tends to help with fears rCo how can hardware
    designers incrementally add complexity with respect to memory
    model [my guess would be by working on reducing the cost of
    barriers in many cases while still requiring them]? Fatigue can
    be countered by limiting analysis and receiving fresh
    perspectives; maybe at the end of the AI hype cycle the could be
    a burst of activity in multithreaded programming and maybe
    academia or "for profit" research organizations can ease the
    perception of risk that encourages overanalysis (cheap capital
    might also help, so again after the AI hype cycle). Research and
    development could also address motivation; while it has been
    known for decades that the cost of sequential consistency could
    be substantially reduced by speculative out-of-order execution,
    the tradeoffs change with time (both in hardware complexity and
    type of software encouraging hardware development).

    But I think that the chilling effects of difficulties in inter-thread communication have kept that back. But difficulties already exist
    with sequential consistency; transactional memory looked like it might
    come to the rescue, but after the hype from about 20 years ago is now
    in the valley of disappointment.

    Interestingly (to me at least) the hardware behind transaction
    memory can also be used to detect when a thread is accessing
    (possibly) lock-protected memory without acquiring a lock.

    Even the end of Dennard scaling and other hardware factors (as
    well as theoretical ILP limits) reducing the single-thread
    performance improvement , it is disappointing that
    multithreaded programs have not become more common (except that
    many uses have good enough performance, which is not
    disappointing).

    I suspect transaction memory was sold too much as a trivial
    solution and some issues were not well understood. I also
    suspect that a more gradual adoption with long-range plans for
    extension (which plans could be adjusted with gained experience)
    might have been more effective.

    (The issue I have with limited optimistic concurrency mechanisms
    like AMD's Advanced Synchronization Facility and My 66000's
    Exotic Synchronization Mechanism is not the initial limits but
    that there seems to be little presentation of an interface that
    can be extended. Just as such avoid requiring new instructions
    for every new simpler atomic operation, a broader interface
    conception might allow extension without adding new
    instructions. Of course, just as early broad software
    abstractions present the risk of choosing the wrong abstraction
    from lack of experience, having too many exceptional cases, and
    delaying release, an ISA can be designed with excessive
    flexibility that is not exploited much later and has immediate
    costs.)

    I.e., the result of a read
    would still return a previously held value, but the "version"
    might be unexpected. The result is not "undefined" but timing
    dependent.

    "Undefined behaviour" typically originally means something where the
    people specifying it have a good idea what can happen, but where it is
    too complex and has too little benefit to actually specify it. E.g.,
    an out-of-bounds access to an object resulted in actually accessing
    that memory, but what is there is not specified and is implementation-dependent, and it might be that the address is not
    accessible, and results in a trap (and it seems to me that everything
    that may result in a trap on some machines has been labeled "undefined behaviour" in C; note the fine differences between
    implementation-defined and undefined behaviour for shifts in C). Only
    later compiler shenanigans like "optimizing" a loop that performs an out-of-bounds access into an endless loop were introduced and
    justified with the specified undefined behaviour.

    There is also Hyrum's Law, that non-architectural behavior (like
    order of hash table elements) which is seemingly consistent
    within a set of implementations will be assumed. The non-
    architectural behavior can either become architectural (at least
    within the context of implementations used by the set of
    software depending on that behavior) or software will break
    (possibly quietly).

    (This also seems related to the Robustness Principle, "be
    conservative in what you send, be liberal in what you accept".
    While such allows communication to work more often by ignoring 'inconsequential' errors, it also encourages misbehavior by not
    giving feedback.)

    For C, I got the impression that much of the undefined behavior
    was initially meant to allow support for things like hardware
    that generated exceptions on signed integer overflow. These
    should have been defined as target-dependent behavior. Some
    behaviors may need to be platform-dependent or even compiler-
    dependent (or flag dependent if the compiler developer wants to
    error on not specifying a behavior rather than having a default
    behavior that can optionally be overridden). I am not certain
    how array overruns could be handled; in some dynamic cases such
    would generate a protection exception, in some cases it could
    cause arbitrary code execution.

    If MY66000 ever becomes a popular architecture with many
    implementations, and if discard's effect has been specified as making
    any read access before a write access to any memory in the cache line "undefined behaviour", we may see implementations that implement
    discard with effects that do not reflect your expectations at all.

    I would expect Mitch to define the behavior within certain
    bounds, including, e.g., "returns a value previously present and
    accessible to the context". If the purpose is to support
    something like thread-local transactional memory (where aborted
    speculation can recover old values), there would have to be some
    constraints for such to be useful (e.g., to prevent a
    speculative value from being written back to the level of the
    memory hierarchy that is treated as authoritative).
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Wed May 13 20:18:22 2026
    From Newsgroup: comp.arch

    On 5/11/26 3:29 AM, Anton Ertl wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/10/26 8:33 PM, MitchAlsup wrote:
    [...]
    The OS zeros the physical page before assigning it to the new
    context (or more likely assigns a zero page and does copy on
    write, which is just zeroing the page).

    Assigning a zero page for reading is a good idea. Copying that page
    on writing appears inefficient to me, because it needs to read the
    zero page into cache and write it to a newly allocated page.

    Copying in soon-to-be-overwritten zero cache blocks would be
    inefficient. One straightforward way to reduce this would be to
    track dirty cache blocks within the new page in a structure
    associated with the L1 TLB. (This would require the other blocks
    to be zeroed when the PTE/block tracking storage is evicted.)

    It is also possible to use indirection in L2 cache blocks to
    support zero blocks with only tag storage. Such indirection can
    be useful for aligned copying, NUCA, and better replacement
    policies.

    Some forms of cache compression (perhaps especially those that
    emphasize run length encoding of the most significant bits)
    would also reduce the overhead of storing zeroed cache blocks.

    (In some sense, copying a zero page is only half copying since
    the read is "free".)

    A better approach is to do just the writes. I think that zeroing the
    page on demand is a good approach, because then it is already in the
    D-cache, but AFAIK Linux actually zeros physical pages ahead of time typically on a separate (otherwise idle) core, and just maps one of
    those pages to the virtual page that needs to be written to. I wonder
    why Linux does that.

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one
    transaction.

    This is more flexible than having cache line and page clearing
    instructions.

    In what way is it more flexible? It is a page-clearing instruction.

    My 66000's memory set instruction is not limited to a page size
    defined when the instruction was generated. IBM's Data Cache
    Block Zero instruction had a compatibility problem when software
    written for early PowerPC caches was to be run on POWER (G5)
    with 128-byte cache blocks.

    If one architecturely defines cache block size and page size,
    one is stuck working around that if a different size is better.
    If these sizes are allowed to vary architecturally, instructions
    dependent on those sizes will change semantics.

    A memory set instruction could clear an arbitrarily aligned 64
    KiB chunk and exploit accidental cache block and page alignment.
    (Using the coherence of the buffers used for ESM, it might even
    be possible to make such atomic by changing the tags from cache
    block aligned addressed to page aligned, at least for ordinary
    memory. I suspect such would not be worthwhile and it would add
    a little extra hardware to track the blocks that have been
    instantiated. Such would also delay the next ESM operation until
    all the pages have been updated [though hardware could use
    entries as they are released so a 2-cache block ESM could
    complete even if some entries are still handling a memory set
    operation rCo I can probably devise even more nearly pointless
    complexityry|].)

    (Yes, multiple page or cache block clearing instructions could
    be merged into a single hardware operation. A memory setting
    instruction front-loads the complexity with the intention [I
    guess] of having less legacy constraint and broader
    utilization.)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Wed May 13 21:11:03 2026
    From Newsgroup: comp.arch

    On 5/10/26 5:03 PM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    Robert Finch <robfi680@gmail.com> posted:
    [snip cache control instruction inclusion question]

    I broke the instruction into 3 sub-groups::
    a) prefetch
    b) invalidate
    c) post-push

    Prefetch brings data closer to the CPU (caches) and provides a specifier
    to which cache {I$, D$, L2, L3} and whether one wants write permission
    (or not).

    Invalidate gets rid of cached data without writing back.

    Post-Push pushes modified data farther from the PCU caches.

    In thinking a little about pushing data, I wonder if "push to
    subscriber" might make sense. This would require a core/
    cache/thread to explicitly "subscribe" to memory addresses. (I
    am not certain if a failed ESM might use something like a
    subscription request to prefetch data.) It might not even be
    necessary for the produce to push the data, though that might
    reduce the overhead in having to check for subscriptions. (An
    ESM transaction that failed another could know which cache lines
    to push and where.)

    If subscriptions were persistent (like for single-reader
    buffers), declarations would be less frequent but that would
    also require more thread state. (Repeating the subscription
    every time the buffer rolled over may be an acceptable tradeoff
    between lost prefetch after a context switch, needing to keep
    extra per-thread state, and more frequent refreshing of the
    subscription.)

    For multiple (reader) subscribers, there may be choices between
    replication and latency.

    (Although it would require two tag probes to detect a
    "subscription", it would be possible to use a single tag per
    page (or other large granule) with data-less cache tags to mark
    such interest. Even with a modulo power-of-two cache indexing,
    conflict issues could be reduced by having the 'listening' cache
    block for a page vary according to the upper bits. Such extra
    tags might also be used for other prefetching. However,
    dedicated hardware would have some advantages.)

    I launched this topic because I can put as many as 32 instructions in this sub-group, and after months of thinking, I only found 19 to put there.
    {yes this violated the R in RISC should me reduced}

    Turn cache on/off (D$ only)

    Why would you want the cache turned off??

    Defective cache blocks without fine-grained disabling? Avoid
    allocation and writeback for a workload that has an access
    stream larger than the cache capacity? Prepare to power down the
    core? Prepare to use the cache as tightly coupled memory?
    Provide coherence when the cache is not coherent for specific
    agents?? Test the performance effect of the cache without
    performance counters?? Because it is there???ry|

    [snip]

    Both the I$ and D$ caches can be invalidated with a single instruction.

    That may take a long time !

    If the valid bits have a shared write port separate from the
    other metadata, this can be done in a single cycle (for fairly
    large caches). If individual block invalidations inserted an
    invalid address as well as using a validity bit, a counter and a
    validity sense bit could be used to provide fast, less frequent
    invalidation with invalid addresses being gradually inserted to
    allow the sense to be reversed again. There may be other
    techniques to make this fast in the common cases.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 14 01:36:42 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/11/26 2:39 AM, Anton Ertl wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model.

    Is that supposed to be a defense of the discard instruction? It
    isn't. Descriptions of weak memory models are full of "undefined behaviour".

    Yes, or at least a statement that if a weaker memory model is
    accepted (which I think Mitch chose for My 66000) then such
    undefined behavior could be acceptable.

    My 66000 memory model is rather strong and depends on memory type
    and whether an ATOMIC event is in progress.

    Configuration space access is strongly ordered
    MMI/O access is sequentially consistent
    ROM access is completely unordered
    Cacheable access with no ATOMIC is causally consistent
    Cacheable access with ATOMIC is sequentially consistent

    Weak memory models are a bad idea like many supercomputer ideas (e.g., division with wrong results,

    You seem to be categorizing BGB ISA as a supercomputer.

    or imprecise exceptions),

    Or none at all {many RISCs}

    but unlike
    other bad ideas they have made it almost to general-purpose computing.
    And they are exactly bad ideas because:

    1) The unpredictable results if their restrictions are not heeded.

    And most programmers remain unaware of the restrictions, compiler writers remain unwilling to even read about restrictions, and ASM(ascii) remains rampant...

    Is this not also true for sequential consistency without locks
    (i.e., with race conditions). If locks (or often challenging to
    design lockfree methods) are acceptable "restrictions", then why
    not allow memory barriers.

    If locks do not implement barriers themselves in a given
    programming language/library (which has performance costs so a
    "zero abstraction cost" language like C++ might want to avoid
    such), this would presumably add significant correctness
    risk/complexity.

    2) The difficulty of heeding these restrictions by adding close to the minimum necessary strongifying instructions (e.g., memory barriers and atomic instructions). In particular, thanks to 1 there is no way to
    check the correctness of the placement of these instructions by
    testing.

    I think hardware could provide detection of at least some race
    conditions. This could also help with debugging improper lock
    use.

    My 66000 has an instruction that can inform SW that deleterious
    interference has been detected on a cache line currently in
    ATOMIC use.

    I found this to be a necessity (circa 2004)

    Using the minimal memory barrier strengths seems like using
    lock-free programming; there might be some cases that are
    trivial and safe, but I would tend to push such toward "experts
    only" and the performance gains would typically not be that
    great if "moderate" effort is made in hardware.

    Not being a hardware person, I do not know how expensive (in power-performance-area or design complexity) sequential
    consistency is relative to a more relaxed model with significant
    optimization of barriers.

    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    An example is how many cycles the latest IBM Z-series takes to
    access L1 cache.

    I _suspect_ such depends of the
    communication between threads (cores/caches), perhaps especially
    the scale of the system.

    System scale multiplies the exponent of <design and test> complexity.

    3) The extreme performance cost of the strongifying instructions, so
    when you use some simple scheme that guarantees correctness (e.g., inserting a write barrier after every store and a read barrier before
    every load), the resulting program is extrememly slow.

    I have had to debug programs where after each printf() I had to insert
    a fflush() just so I could read the error messages before dereferencing
    a random NULL pointer.

    This assumes that barriers are expensive. This is similar to
    assuming that context switches are expensive; it is a historical
    fact but not a technical necessity.

    *If* there is very little advantage to optimized barrier weaker
    consistency, then using a simpler abstraction would probably be
    a better choice.

    In the case of weak memory models the hardware designers have the
    excuse that they are too lazy to implement a strong memory model efficiently (although they typically frame it by showing the
    inefficiency of some lazy implementation of a strong memory model),
    and that not that big parts of the software actually communicate with
    other threads.

    Or did not want the performance degradation.

    While not wanting to do hard and underappreciated work is
    understandable and humans will tend to justify their choices to
    themselves and others using less than perfect argumentation, I
    think the above statement is too hostile to hardware designers.

    The problem is that SW requires the illusion that the CPU is doing
    exactly 1 thing and then moves on to the next thing--while HW knows
    that 10-FUs are making progress this cycle, while 8 miss buffers are
    waiting for requests to return in random order, while watching for
    interrupts and exceptions (to say nothing of machine checks), while
    fetching from the ICache while various SNOOPs are hitting on those
    FETCHes.

    Consider the GBOoO machine <again> here we have an implementation
    that can process millions of instructions, leaving proper bit
    patterns in every register and in memory, and have done it in
    in an unacceptable order ?!?!?

    There is still no model of multi-processing that is acceptable to
    verification people--so no reliable way to test the multi-process-
    ness of <various> chip <components>.

    There may well be a "laziness" in not researching what the costs
    are in current software,

    How many months of chip tape-out are you willing to accept to
    allow your HW designers to go an do this research ???

    how much a weaker memory model
    discourages new useful software techniques,

    I am reminded about how all the new nanny features of automobiles
    are loathed by the drivers/owners of those cars with those features.

    and how complex and
    effective hardware mechanisms for providing stronger consistency
    models are. But laziness is generally a combination of fear,
    fatigue, and lack of motivation; adding complexity increases
    schedule risk (fear) and without customers pushing for a feature
    (and a belief that the feature can be delivered within time and
    financial budgets) motivation will be lower. I would not be
    surprised that there is also a fatigue factor.

    ATOMICITY is a prime example:: beacuse the gap between what HW
    people understand and what the vonNeumann model provides, little
    progress has been made since the development of LL/SC in 1980 !!
    With entire highways littered with (at best) semi-useable ATOMIC
    support invented since then.

    Memory barriers have all the problems of ATOMICITY and those of
    scheduling of waiting threads in HW. The only machine that even
    did remotely well on that was the (now defunct) Dencor HEP-
    where waiting threads would quietly sleep until the barrier was
    'done' and all could be awakened (1 cycle each).

    Calling a person lazy seems generally ineffective in reducing
    fear or increasing motivation much less reducing fatigue. I
    suspect this strategy is also not very effective at an
    organization level.

    Why is it that eXcel makes such a great MUNG programming language?
    It is precisely that it (eXcel) does all the looping and subroutine
    calling on behalf of spreadsheet users. ... And has great graphics
    support.

    I, personally, have eXcel spreadsheets with over 1,000,000 cells
    with equations in them leading to 1 VISUAL result.

    On a personal level, one can even know that fears are
    irrational, that most of the fatigue is psychologically induced,
    and that getting something done is usually good yet be unable to
    act.

    Incremental exposure tends to help with fears rCo how can hardware
    designers incrementally add complexity with respect to memory
    model [my guess would be by working on reducing the cost of
    barriers in many cases while still requiring them]? Fatigue can
    be countered by limiting analysis and receiving fresh
    perspectives; maybe at the end of the AI hype cycle the could be
    a burst of activity in multithreaded programming and maybe
    academia or "for profit" research organizations can ease the
    perception of risk that encourages overanalysis (cheap capital
    might also help, so again after the AI hype cycle). Research and
    development could also address motivation; while it has been
    known for decades that the cost of sequential consistency could
    be substantially reduced by speculative out-of-order execution,
    the tradeoffs change with time (both in hardware complexity and
    type of software encouraging hardware development).

    But I think that the chilling effects of difficulties in inter-thread communication have kept that back. But difficulties already exist
    with sequential consistency; transactional memory looked like it might
    come to the rescue, but after the hype from about 20 years ago is now
    in the valley of disappointment.

    Interestingly (to me at least) the hardware behind transaction
    memory can also be used to detect when a thread is accessing
    (possibly) lock-protected memory without acquiring a lock.

    I suspect that HW TM will never take hold of the CPU industry.

    Even the end of Dennard scaling and other hardware factors (as
    well as theoretical ILP limits) reducing the single-thread
    performance improvement , it is disappointing that
    multithreaded programs have not become more common (except that
    many uses have good enough performance, which is not
    disappointing).

    I took a shot at this above: SW has not developed a model which
    is sufficiently understandable to HW for HW to implement anything
    close enough to what SW really wants. {I know as I have tried over
    45 years now with little success--perhaps you are right and I am
    simply tired.}

    I suspect transaction memory was sold too much as a trivial
    solution and some issues were not well understood.

    My point exactly; SW created TM and then told HW guys what was
    needed (but only to the 90% level); HW guys built various TMs
    and none of them was "really" what SW wanted ?!?!? Telling
    HW guys how TM works but no model of corner cases is a way of
    steering the ship in a random direction hoping not to hit an
    iceberg.

    I also
    suspect that a more gradual adoption with long-range plans for
    extension (which plans could be adjusted with gained experience)
    might have been more effective.

    (The issue I have with limited optimistic concurrency mechanisms
    like AMD's Advanced Synchronization Facility and My 66000's
    Exotic Synchronization Mechanism is not the initial limits but
    that there seems to be little presentation of an interface that
    can be extended.

    For example:: what ??

    Just as such avoid requiring new instructions
    for every new simpler atomic operation, a broader interface
    conception might allow extension without adding new
    instructions.

    One can write in C every synchronization method ever conceived of
    in academia and industry <literature> using the ESM specification
    My 66000 Software Principles, one can use it to help SW build a
    working TM using SW principles.

    Of course, just as early broad software
    abstractions present the risk of choosing the wrong abstraction
    from lack of experience, having too many exceptional cases, and
    delaying release, an ISA can be designed with excessive
    flexibility that is not exploited much later and has immediate
    costs.)

    That is the problem when you have only been working on it for 22 years----------------alone---------------without feedback

    I.e., the result of a read
    would still return a previously held value, but the "version"
    might be unexpected. The result is not "undefined" but timing
    dependent.

    "Undefined behaviour" typically originally means something where the
    people specifying it have a good idea what can happen, but where it is
    too complex and has too little benefit to actually specify it. E.g.,
    an out-of-bounds access to an object resulted in actually accessing
    that memory, but what is there is not specified and is implementation-dependent, and it might be that the address is not accessible, and results in a trap (and it seems to me that everything
    that may result in a trap on some machines has been labeled "undefined behaviour" in C; note the fine differences between
    implementation-defined and undefined behaviour for shifts in C). Only later compiler shenanigans like "optimizing" a loop that performs an out-of-bounds access into an endless loop were introduced and
    justified with the specified undefined behaviour.

    There is also Hyrum's Law, that non-architectural behavior (like
    order of hash table elements) which is seemingly consistent
    within a set of implementations will be assumed. The non-
    architectural behavior can either become architectural (at least
    within the context of implementations used by the set of
    software depending on that behavior) or software will break
    (possibly quietly).

    It never ceased to amaze me that Solaris would not boot without a
    real TLM in the simulator. Just referencing all the right mmory
    where the tables were stored (using the CR holding said pointer)
    was not enough--you had to have a TLB with at least 5 FA entries.

    (This also seems related to the Robustness Principle, "be
    conservative in what you send, be liberal in what you accept".
    While such allows communication to work more often by ignoring 'inconsequential' errors, it also encourages misbehavior by not
    giving feedback.)

    For C, I got the impression that much of the undefined behavior
    was initially meant to allow support for things like hardware
    that generated exceptions on signed integer overflow. These
    should have been defined as target-dependent behavior. Some
    behaviors may need to be platform-dependent or even compiler-
    dependent (or flag dependent if the compiler developer wants to
    error on not specifying a behavior rather than having a default
    behavior that can optionally be overridden). I am not certain
    how array overruns could be handled; in some dynamic cases such
    would generate a protection exception, in some cases it could
    cause arbitrary code execution.

    If MY66000 ever becomes a popular architecture with many
    implementations, and if discard's effect has been specified as making
    any read access before a write access to any memory in the cache line "undefined behaviour", we may see implementations that implement
    discard with effects that do not reflect your expectations at all.

    I would expect Mitch to define the behavior within certain
    bounds, including, e.g., "returns a value previously present and
    accessible to the context". If the purpose is to support
    something like thread-local transactional memory (where aborted
    speculation can recover old values), there would have to be some
    constraints for such to be useful (e.g., to prevent a
    speculative value from being written back to the level of the
    memory hierarchy that is treated as authoritative).

    Mitch considers TM to be a SW problem and My 66000 ISA supports SW
    by allowing multiple lines to participate in a TM transaction,
    without over constraining how SW gets its job done, and with enough
    HW defined behavior that SW can make a robust system with it. Other
    than that TM is a SW problem.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu May 14 14:57:33 2026
    From Newsgroup: comp.arch

    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/11/26 3:29 AM, Anton Ertl wrote:

    A better approach is to do just the writes. I think that zeroing the
    page on demand is a good approach, because then it is already in the
    D-cache, but AFAIK Linux actually zeros physical pages ahead of time
    typically on a separate (otherwise idle) core, and just maps one of
    those pages to the virtual page that needs to be written to. I wonder
    why Linux does that.

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one
    transaction.

    This is more flexible than having cache line and page clearing
    instructions.

    In what way is it more flexible? It is a page-clearing instruction.

    My 66000's memory set instruction is not limited to a page size
    defined when the instruction was generated. IBM's Data Cache
    Block Zero instruction had a compatibility problem when software
    written for early PowerPC caches was to be run on POWER (G5)
    with 128-byte cache blocks.

    ARM has a system register that software can access to determine
    the cache line size for the DC ZVA instruction.


    If one architecturely defines cache block size and page size,
    one is stuck working around that if a different size is better.

    Or, provide a mechanism for the software that performs
    the zeroing to determine both the cache and page sizes
    dynamically.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu May 14 12:22:03 2026
    From Newsgroup: comp.arch

    On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory.

    Also the Motorola 68040.

    Upon updating instructions (e.g., from a JIT compiler), they require
    that the modifying thread(s) write the lines back from the data
    cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on
    the executing threads are extremely difficult to meet if the
    executing threads run independently of the modifying thread(s). Or,
    in short, IA-32 and AMD64 did the right architecture for that.

    One technique for implementing lexical binding and functions as
    first-class objects involves generating code at run-time. Some people
    would immediately gasp and say rCLself-modifying coderCY as soon as I mentioned this, even though the two are quite different things.

    I think itrCOs quite desirable that an architecture guarantees that an
    (to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same
    memory location will never show different values.

    Sometimes, there is a difference between "nice to have", vs "cost
    effective".

    It is nicer, say, to have I$/D$ coherence, and to not require explicit flushing and invalidation. Or, say, to have caches that are implicitly coherent between threads (Core A stores to a location, Core B loads from
    that location, Core B sees what Core A stored).


    The requirements to pull all this off in practice may add significant
    costs; and also in ways where the performance cost of the coherence
    mechanisms tend to scale upwards as core counts increase.

    Say, for example, if one has coherent caches, software that depends on
    the cache-coherent behavior, and much more than 2 or 4 cores, it is not difficult to imagine scenarios where waiting on cache-coherence
    mechanisms becomes a more significant cost than actual memory-transfer bandwidth on the bus.


    Say, typical scenario with incoherent caches:
    Core A Requests Line (for Write);
    Core B Requests Line (also for Write);
    L2 Cache sends a copy to A;
    L2 Cache sends a copy to B.
    A and B now have incoherent copies.

    Versus Say:
    Core A Requests Line (for Write);
    Core B Requests Line (also for Write);
    L2 Cache sends a copy to A;
    L2 Cache rejects B's Request;
    L2 Cache sense a request to A to write line back;
    Core A writes line back (flushing it locally);
    (Maybe) L2 signals to Core B that the line is now available.
    Core B Requests Line again (retry);
    L2 Cache sends a copy to B.


    In my approach, I went with incoherent caches, but with a special
    Volatile mechanism for some cases, say:
    Core A requests a line for Volatile Write;
    Core B Requests Line (also for Volatile Write);
    L2 Cache sends a copy to A;
    L2 Cache ignores B's Request (it can cycle the ring some more);
    L2 cache can track volatile lines and see that it is in-use.
    Core A writes back line and flushes local copy;
    L2 cache then marks the volatile access as complete.
    L2 Cache sends a copy to B
    Via the original request cycling around and hitting L2 again
    Core B writes back line and flushes local copy;
    L2 cache then marks the volatile access as complete.

    Because volatile accesses flush the cached dirty lines immediately, this
    means that there is a performance penalty, but these accesses can remain coherent (but without the impact of trying to make all memory coherent).


    For something like an inter-processor JIT, this would alas still require flushing the L1 caches in a way that is coordinated between threads.

    Normally, the mutex mechanism does not include I$ flushes, though one possibility could be to have, say, a separate JIT mutex lock, where if
    threads (upon trying to lock a mutex) see a JIT Sequence Number that
    does not match the expected value for that mutex on that processor core,
    it also triggers an I$ flush.

    Say:
    JIT Lock:
    Flush Caches;
    Lock Mutex;
    Increment JIT Sequence Number (JSN).
    Do stuff;
    Flush Caches;
    Unlock Mutex;
    Flush Caches;
    Set mutex to unlocked.
    Lock Mutex (Normal):
    Flush Caches;
    Lock Mutex;
    Check JSN against cores' current JSN;
    If mismatch, flush I$ and update core's JSN.
    Likely all via CPUID and a lookup table, not new arch.
    Do Stuff;
    Unlock Mutex:
    ...
    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu May 14 14:48:09 2026
    From Newsgroup: comp.arch

    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/11/26 2:39 AM, Anton Ertl wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    Weak memory models are a bad idea like many supercomputer ideas (e.g.,
    division with wrong results, or imprecise exceptions), but unlike
    other bad ideas they have made it almost to general-purpose computing.
    And they are exactly bad ideas because:

    1) The unpredictable results if their restrictions are not heeded.

    Is this not also true for sequential consistency without locks
    (i.e., with race conditions).

    There are cases where the result is predictable with sequential
    consistency but unpredictable with weak consistency. E.g., starting
    with x=0, y=0, if you have

    thread 0 thread 1
    x=1 while (y)
    y=1 ;
    print x

    with sequential consistency thread 1 is guaranteed to print 1, whereas
    weak consistency may print 0 or 1.

    But sure, atomic reads and writes with sequential consistency are not
    always enough.

    If locks (or often challenging to
    design lockfree methods) are acceptable "restrictions", then why
    not allow memory barriers.

    "Allow"? If you implement sequential consistency, memory barriers are
    noops. If you have enough encoding space, go ahead and add a bunch of
    noops. I would not know what they are good for, however.

    2) The difficulty of heeding these restrictions by adding close to the
    minimum necessary strongifying instructions (e.g., memory barriers and
    atomic instructions). In particular, thanks to 1 there is no way to
    check the correctness of the placement of these instructions by
    testing.

    I think hardware could provide detection of at least some race
    conditions.

    "Some" is not good enough.

    Concerning "race condition", while one might describe the result of
    the lack of the memory barriers necessary for making the example above
    work on weak memory models as "race condition", I guess the fans of
    weak memory models have more refined names for that. The code above
    certainly does not have a race condition on a sequentially consistent processor.

    Using the minimal memory barrier strengths seems like using
    lock-free programming; there might be some cases that are
    trivial and safe, but I would tend to push such toward "experts
    only"

    Exactly.

    and the performance gains would typically not be that
    great if "moderate" effort is made in hardware.

    I have no idea what you mean here. Performance gains of what over
    what, and what is '"moderate" effort'?

    Not being a hardware person, I do not know how expensive (in >power-performance-area or design complexity) sequential
    consistency is relative to a more relaxed model with significant
    optimization of barriers.

    What do you mean with "significant optimization of barriers"?

    As for implementing sequential consistency, Daya et al. have described
    one way to do it. I expect that if the hardware designers put their
    minds to it, they will come up with better ways.

    @InProceedings{daya+14,
    author = {Bhavya K. Daya and Chia-Hsin Owen Chen and Suvinay
    Subramanian and Woo-Cheol Kwon and Sunghyun Park and
    Tushar Krishna and Jim Holt and Anantha
    P. Chandrakasan and L-Shiuan Peh},
    title = {{SCORPIO}: A 36-Core Research-Chip Demonstrating
    Snoopy Coherence on a Scalable Mesh {NoC} with
    In-Network Ordering},
    crossref = {isca14},
    OPTpages = {},
    url = {http://projects.csail.mit.edu/wiki/pub/LSPgroup/PublicationList/scorpio_isca2014.pdf},
    annote = {The cores on the chip described in this paper access
    their shared memory in a sequentially consistent
    manner; what's more, the chip provides a significant
    speedup in comparison to the distributed directory
    and HyperTransport coherence protocols. The main
    idea is to deal with the ordering separately from
    the data, in a distributed way. The ordering
    messages are relatively small (one bit per core).
    For details see the paper.}
    }

    @Proceedings{isca14,
    title = "$41^\textit{st}$ Annual International Symposium on Computer Architecture",
    booktitle = "$41^\textit{st}$ Annual International Symposium on Computer Architecture",
    year = "2014",
    key = "ISCA 2014",
    }

    3) The extreme performance cost of the strongifying instructions, so
    when you use some simple scheme that guarantees correctness (e.g.,
    inserting a write barrier after every store and a read barrier before
    every load), the resulting program is extrememly slow.

    This assumes that barriers are expensive.

    That's what I have read often, and the description of e.g. the Linux
    kernel mb() (from https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/memory-access-ordering-part-2---barriers-and-the-linux-kernel)
    sounds expensive:

    |A full system memory barrier. All memory operations before the mb() in
    |the instruction stream will be committed before any operations after
    |the mb() are committed. This ordering will be visible to all bus
    |masters in the system. It will also ensure the order in which accesses
    |from a single processor reaches slave devices.


    OTOH, https://kunalspathak.github.io/2020-07-25-ARM64-Memory-Barriers/ describes that a loop with about 20 instructions, including 4 "dmb
    oshld" instructions, sees a 30% speedup (on unspecified ARM64
    hardware) by changing the code such that the mempru barrier
    instructions are moved out of the loop, which is not as expensive as I
    would have expected.

    This is similar to
    assuming that context switches are expensive; it is a historical
    fact but not a technical necessity.

    Of course. Implement sequential consistency in your hardware, and
    memory barriers become noops.

    Another example is the trapb instruction on Alpha: For Alpha
    implementations (21064, 21164) with imprecise exceptions, it was
    costly, for implementations with precise exceptions (21264), it was a
    noop and cost almost nothing.

    In the case of weak memory models the hardware designers have the
    excuse that they are too lazy to implement a strong memory model
    efficiently (although they typically frame it by showing the
    inefficiency of some lazy implementation of a strong memory model),
    and that not that big parts of the software actually communicate with
    other threads.

    While not wanting to do hard and underappreciated work is
    understandable and humans will tend to justify their choices to
    themselves and others using less than perfect argumentation, I
    think the above statement is too hostile to hardware designers.

    It's well deserved, although simplistic, and the blame not only lies
    with hardware designers.

    A more detailed view is much longer: Shared-memory multiprocessors
    first appeared in supercomputers, mainframes and superminis.

    In mainframes and superminis they were not used for multi-threading,
    so only OS developers had to deal with that and they had only 2-4
    CPUs, so relatively simple ways to deal with multiprocessing were good
    enough, such as assymetric multiprocessing or (in Linux up to 2011)
    SMP with a giant lock. The mainframes and superminis were used to run
    multiple processes, so user-level code did not talk to other CPUs
    except through the OS. And that was true for Unix systems as well for
    a long time, with POSIX threads appearing in 1995, and slow pickup by programmers. I guess that specialized software like database systems
    used multi-threading, but the mainstream was single-threaded for a
    long time, and much of it still is. So, for a long time, nearly all
    mainframe, mini and Unix users were not affected by memory models.

    On supercomputers the software crisis has not happened yet (hardware
    is still more expensive than software), so whatever hardware designers
    come up with in the name of speed (e.g., Crays imprecise division),
    the software people have to cope with. (Counterevidence: The Fujitsu
    A64FX implements a strong memory model, although it implements the ARM
    A64 architecture, which requires only a weak memory model; and the ARM
    people are actively campaigning against Linux providing features that
    let software make use of the strong memory model on the A64FX and on
    Apple silicon).

    On the hardware side, the easiest way to do a shared-memory
    multiprocessing system is to do it without any coherence between
    processors. That's not usable, however, so the next laziest thing a
    hardware designer can do is to make it basically incoherent and
    provide facilities that let software tell the hardware when coherence
    is needed (e.g., memory barriers), and so weak memory models were
    born. Also, many of the early machines had no caches, so
    cache-coherence was not a thing.

    Given that in the beginning only little software (the locking
    implementations themselves, maybe some parts of databases and maybe
    some communicating parts of supercomputer software) actually needed
    any memory coherence, there were few complaints at the time, and this
    was deemed acceptable. Not to everyone, however: Intel provided a
    strong memory model for its multiprocessors and AMD followed suit, and
    SPARC provided several memory models, including the relatively strong
    TSO (but the fact that they also have the others indicates that they
    were not fully committed to efficient strong ordering).

    When multiprocessing hardware became more widespread and
    multi-threaded software became more widespread, the weak memory models
    became more problematic, so people wrote articles like that by Adve
    and Gharachorloo, which indoctrinates people about the greatness of
    weak memory models and how they should program them.

    @TechReport{adve&gharachorloo95,
    author = {Sarita V. Adve and Kourosh Gharachorloo},
    title = {Shared Memory Consistency Models: A Tutorial},
    institution = {Digital Western Research Lab},
    year = {1995},
    type = {WRL Research Report},
    number = {95/7},
    annote = {Gives an overview of architectural features of
    shared-memory computers such as independent memory
    banks and per-CPU caches, and how they make the (for
    programmers) most natural consistency model hard to
    implement, giving examples of programs that can fail
    with weaker consistency models. It then discusses
    several categories of weaker consistency models and
    actual consistency models in these categories, and
    which ``safety net'' (e.g., memory barrier
    instructions) programmers need to use to work around
    the deficiencies of these models. While the authors
    recognize that programmers find it difficult to use
    these safety nets correctly and efficiently, it
    still advocates weaker consistency models, claiming
    that sequential consistency is too inefficient, by
    outlining an inefficient implementation (which is of
    course no proof that no efficient implementation
    exists). Still the paper is a good introduction to
    the issues involved.}
    }

    And as a consequence, we now have people who have invested a lot of
    time into learning how to work around to pitfalls of weak memory
    models and thus became advocates of weak memory models (a variant of
    the sunk-cost fallacy) and their supposed benefits.

    Many others in software either still program single-threaded
    applications and don't care, or just use locks and other higher-level
    ways to deal with concurrency, possibly after reading about weak
    memory models and deciding that their life is too short for this
    nonsense.

    So lazy hardware designers like those of ARM don't get enough feedback
    that would convince them to change course.

    There may well be a "laziness"

    Laziness is a virtue, except when it isn't. In this case it isn't:
    The cost of weak memory models to software and in particular the
    opportunity cost of people shying away from implementing lockless data structures is higher than the cost of implementing a strong memory
    model. The memory model is ideally sequential consistency, but
    whether the benefit/cost ratio of that compared to TSO is high enough
    to justify it is not yet clear, because the cost is not yet clear, for
    lack of trying).

    Calling a person lazy seems generally ineffective

    Given the state of affairs described above, calling a spade a spade
    can be useful to cut through the veil of advocacy for weak
    consistency. That advocacy gives many software people the impression
    that weak consistency may be hard to program, but that there are good
    reasons for that. The reason is that efficient strong consistency is
    more work for hardware implementors, and that not enough software
    people have made it clear that they won't put up with weak
    consistency. If many software people decided to use lockless only for
    strong memory models and use locks for weak memory models, resulting
    in slowness on ARM, ARM would support at least TSO ASAP.

    But I think that the chilling effects of difficulties in inter-thread
    communication have kept that back. But difficulties already exist
    with sequential consistency; transactional memory looked like it might
    come to the rescue, but after the hype from about 20 years ago is now
    in the valley of disappointment.
    ...
    Even the end of Dennard scaling and other hardware factors (as
    well as theoretical ILP limits) reducing the single-thread
    performance improvement , it is disappointing that
    multithreaded programs have not become more common (except that
    many uses have good enough performance, which is not
    disappointing).

    This, and multithreaded programming is more work, especially if the
    idea is to make the program run faster.

    I suspect transaction memory was sold too much as a trivial
    solution and some issues were not well understood.

    It seems there has been relatively little pickup of the things that
    were implemented (why?), and people prefer to use the earlier atomic instructions instead (why? performance reasons?). The only
    application that is known for using TSX is SAP HANA.

    And eventually the fact that the TSX instructions provide a
    high-bandwidth speculative side channel led to their disabling across
    most of the line (apparently both HLE and RTM instructions, as they
    are both disabled in newer processors).

    So another benefit of implementing invisible speculation (to counter
    Spectre) would probably be to be able to enable TSX again.

    Here we have a Haswell that has HLE and RTM enabled. Maybe one day
    I'll find the time to play around with it.

    AMD's Advanced Synchronization Facility

    AMD apparently stopped working on that in 2009. Did they know things
    that Intel didn't?

    (This also seems related to the Robustness Principle, "be
    conservative in what you send, be liberal in what you accept".
    While such allows communication to work more often by ignoring >'inconsequential' errors, it also encourages misbehavior by not
    giving feedback.)

    Well, the problem with C compilers and undefined behaviour is that
    they do not give feedback, but the next version of the compiler will
    silently compile your tested and working code into non-working code,
    or it will silently "optimize" away your bounds checks (do you have
    test cases for the bounds checks in your test suite?), etc.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 14 18:31:37 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory.

    Also the Motorola 68040.

    Upon updating instructions (e.g., from a JIT compiler), they require
    that the modifying thread(s) write the lines back from the data
    cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on
    the executing threads are extremely difficult to meet if the
    executing threads run independently of the modifying thread(s). Or,
    in short, IA-32 and AMD64 did the right architecture for that.

    One technique for implementing lexical binding and functions as
    first-class objects involves generating code at run-time. Some people
    would immediately gasp and say rCLself-modifying coderCY as soon as I mentioned this, even though the two are quite different things.

    I think itrCOs quite desirable that an architecture guarantees that an
    (to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same
    memory location will never show different values.

    Sometimes, there is a difference between "nice to have", vs "cost effective".

    It is nicer, say, to have I$/D$ coherence, and to not require explicit flushing and invalidation. Or, say, to have caches that are implicitly coherent between threads (Core A stores to a location, Core B loads from that location, Core B sees what Core A stored).


    The requirements to pull all this off in practice may add significant
    costs; and also in ways where the performance cost of the coherence mechanisms tend to scale upwards as core counts increase.

    Say, for example, if one has coherent caches, software that depends on
    the cache-coherent behavior, and much more than 2 or 4 cores, it is not difficult to imagine scenarios where waiting on cache-coherence
    mechanisms becomes a more significant cost than actual memory-transfer bandwidth on the bus.


    Say, typical scenario with incoherent caches:
    Core A Requests Line (for Write);
    Core B Requests Line (also for Write);
    L2 Cache sends a copy to A;
    L2 Cache sends a copy to B.
    A and B now have incoherent copies.

    Try the above using an incoherent TLB model !!!

    Oh and BTW, the same arguments for cache coherence argues to for
    TLBs coherence. It is just easier for everyone.


    Versus Say:
    Core A Requests Line (for Write);
    Core B Requests Line (also for Write);
    L2 Cache sends a copy to A;
    L2 Cache rejects B's Request;
    L2 Cache sense a request to A to write line back;
    Core A writes line back (flushing it locally);
    (Maybe) L2 signals to Core B that the line is now available.
    Core B Requests Line again (retry);
    L2 Cache sends a copy to B.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu May 14 20:07:31 2026
    From Newsgroup: comp.arch

    On 5/14/2026 1:31 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory.

    Also the Motorola 68040.

    Upon updating instructions (e.g., from a JIT compiler), they require
    that the modifying thread(s) write the lines back from the data
    cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on
    the executing threads are extremely difficult to meet if the
    executing threads run independently of the modifying thread(s). Or,
    in short, IA-32 and AMD64 did the right architecture for that.

    One technique for implementing lexical binding and functions as
    first-class objects involves generating code at run-time. Some people
    would immediately gasp and say rCLself-modifying coderCY as soon as I
    mentioned this, even though the two are quite different things.

    I think itrCOs quite desirable that an architecture guarantees that an
    (to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same
    memory location will never show different values.

    Sometimes, there is a difference between "nice to have", vs "cost
    effective".

    It is nicer, say, to have I$/D$ coherence, and to not require explicit
    flushing and invalidation. Or, say, to have caches that are implicitly
    coherent between threads (Core A stores to a location, Core B loads from
    that location, Core B sees what Core A stored).


    The requirements to pull all this off in practice may add significant
    costs; and also in ways where the performance cost of the coherence
    mechanisms tend to scale upwards as core counts increase.

    Say, for example, if one has coherent caches, software that depends on
    the cache-coherent behavior, and much more than 2 or 4 cores, it is not
    difficult to imagine scenarios where waiting on cache-coherence
    mechanisms becomes a more significant cost than actual memory-transfer
    bandwidth on the bus.


    Say, typical scenario with incoherent caches:
    Core A Requests Line (for Write);
    Core B Requests Line (also for Write);
    L2 Cache sends a copy to A;
    L2 Cache sends a copy to B.
    A and B now have incoherent copies.

    Try the above using an incoherent TLB model !!!

    Oh and BTW, the same arguments for cache coherence argues to for
    TLBs coherence. It is just easier for everyone.


    If you mean what happens if code running on Core A changes page mappings
    that apply to a thread running on Core B?...

    Could maybe add some mechanism to flag this case, such that (during the
    next context switch or system call), the processor in question knows it
    needs to flush the TLB.

    Could likely be handled in a similar way to the JIT-cache.
    Or, maybe the JIT flush handling and similar could be moved to the
    Syscall handler rather than Mutex handling (as code will hit the syscall handler often but mutex locking is much rarer).

    ...


    Versus Say:
    Core A Requests Line (for Write);
    Core B Requests Line (also for Write);
    L2 Cache sends a copy to A;
    L2 Cache rejects B's Request;
    L2 Cache sense a request to A to write line back;
    Core A writes line back (flushing it locally);
    (Maybe) L2 signals to Core B that the line is now available.
    Core B Requests Line again (retry);
    L2 Cache sends a copy to B.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Andy Valencia@vandys@vsta.org to comp.arch on Thu May 14 18:53:55 2026
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    It is nicer, say, to have I$/D$ coherence, and to not require explicit flushing and invalidation. Or, say, to have caches that are implicitly coherent between threads (Core A stores to a location, Core B loads from that location, Core B sees what Core A stored).

    The requirements to pull all this off in practice may add significant
    costs; and also in ways where the performance cost of the coherence mechanisms tend to scale upwards as core counts increase.

    OTOH, again and again, I found that all the edge cases of explicit, software-driven coherence required pessimistic assumptions which were
    slower than leaning on hardware. Pick the tiny subset of SW behaviors
    which you'll support--and include enforcement--or else be prepared for
    the steep slope downward.

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri May 15 00:29:39 2026
    From Newsgroup: comp.arch

    On 5/14/2026 8:53 PM, Andy Valencia wrote:
    BGB <cr88192@gmail.com> writes:
    It is nicer, say, to have I$/D$ coherence, and to not require explicit
    flushing and invalidation. Or, say, to have caches that are implicitly
    coherent between threads (Core A stores to a location, Core B loads from
    that location, Core B sees what Core A stored).

    The requirements to pull all this off in practice may add significant
    costs; and also in ways where the performance cost of the coherence
    mechanisms tend to scale upwards as core counts increase.

    OTOH, again and again, I found that all the edge cases of explicit, software-driven coherence required pessimistic assumptions which were
    slower than leaning on hardware. Pick the tiny subset of SW behaviors
    which you'll support--and include enforcement--or else be prepared for
    the steep slope downward.


    Granted, software driven flushing isn't free either.

    Could require more looking into to figure out what is a better approach
    in terms of balancing performance against hardware cost and similar; and
    how tightly coherence actually needs to be maintained in practice.

    Well, and the tradeoff that some approaches to multi-threaded
    programming which work on a fully-coherent system will not work if
    threads are scheduled across cores in an incoherent model. So, say,
    absent knowing that a program is weak-model safe, it would be necessary
    to schedule all threads for a process on a single processor core.

    ...


    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri May 15 03:08:03 2026
    From Newsgroup: comp.arch

    On 5/13/2026 8:36 PM, MitchAlsup wrote:

    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/11/26 2:39 AM, Anton Ertl wrote:
    Paul Clayton <paaronclayton@gmail.com> writes:
    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model.

    Is that supposed to be a defense of the discard instruction? It
    isn't. Descriptions of weak memory models are full of "undefined
    behaviour".

    Yes, or at least a statement that if a weaker memory model is
    accepted (which I think Mitch chose for My 66000) then such
    undefined behavior could be acceptable.

    My 66000 memory model is rather strong and depends on memory type
    and whether an ATOMIC event is in progress.

    Configuration space access is strongly ordered
    MMI/O access is sequentially consistent
    ROM access is completely unordered
    Cacheable access with no ATOMIC is causally consistent
    Cacheable access with ATOMIC is sequentially consistent


    Yeah, it is mostly my stuff where I am going with super-weak memory
    models...

    Could maybe make sense to allow hardware to support stronger models
    though, maybe pay a cost.

    For example, a page can be marked as volatile, and it will remain
    coherent, if slow (because every access makes a round trip to the L2
    cache or similar; where in my core the L2 cache is currently the sole
    bridge to main RAM).


    Weak memory models are a bad idea like many supercomputer ideas (e.g.,
    division with wrong results,

    You seem to be categorizing BGB ISA as a supercomputer.


    Yeah, it is not...


    Say:
    Originally, was fiddling with stuff
    Imagined use-case was mostly robots and CNC controller stuff;
    But, alas, this case better served by RV32 or similar;
    Or a plain microcontroller...
    Then running Doom and similar;
    Then running OpenGL and GLQuake
    Most of the RGB and SIMD stuff happened here;
    Basically, tweaking the ISA to allow OpenGL stuff to be faster.
    Also overlaps with stuff relevant for Neural-Nets.
    Now, for "gluing stuff onto RISC-V so it is less slow..."
    Because, in its pure form, RISC-V performance is lacking.
    And, no, RV-V will not make Doom, Quake, or GL stuff faster (*1).


    *1: And, no, I am not designing my ISA as an accelerator for LLMs, even
    if it could arguably be better at this task than plain RISC-V would. Ironically, running LLMs is probably something that could map OK to
    RV-V, But, there is more to life than this.


    Granted, more debatable if there is a practical use-case for this stuff.

    Well, and possibly a lot of the RV people would be like "Who cares if
    you could add some features and make Doom 40% faster? We only care about SPEC..."



    or imprecise exceptions),

    Or none at all {many RISCs}


    For most trap-and-emulate cases, one does need to be able to fault on
    the correct instruction.

    So:
    TLB Miss Handling;
    FPU Emulation;
    ...
    All depending on being able to trap on the correct instruction.


    Well, or if one wants to cut corners on interrupt handling here, then
    they force themselves into the more expensive corner of supporting
    everything in hardware.

    Well, and because no ISA that I am aware of does software-managed
    address translation, but there is basically no way this could be done
    without effectively tanking performance.


    <snip>


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri May 15 14:13:32 2026
    From Newsgroup: comp.arch

    On 5/14/2026 10:22 AM, BGB wrote:
    On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory.

    Also the Motorola 68040.

    Upon updating instructions (e.g., from a JIT compiler), they require
    that the modifying thread(s) write the lines back from the data
    cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on
    the executing threads are extremely difficult to meet if the
    executing threads run independently of the modifying thread(s). Or,
    in short, IA-32 and AMD64 did the right architecture for that.

    One technique for implementing lexical binding and functions as
    first-class objects involves generating code at run-time. Some people
    would immediately gasp and say rCLself-modifying coderCY as soon as I
    mentioned this, even though the two are quite different things.

    I think itrCOs quite desirable that an architecture guarantees that an
    (to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same
    memory location will never show different values.

    Sometimes, there is a difference between "nice to have", vs "cost effective".

    It is nicer, say, to have I$/D$ coherence, and to not require explicit flushing and invalidation. Or, say, to have caches that are implicitly coherent between threads (Core A stores to a location, Core B loads from that location, Core B sees what Core A stored).


    The requirements to pull all this off in practice may add significant
    costs; and also in ways where the performance cost of the coherence mechanisms tend to scale upwards as core counts increase.

    Say, for example, if one has coherent caches, software that depends on
    the cache-coherent behavior, and much more than 2 or 4 cores, it is not difficult to imagine scenarios where waiting on cache-coherence
    mechanisms becomes a more significant cost than actual memory-transfer bandwidth on the bus.


    Say, typical scenario with incoherent caches:
    -a Core A Requests Line (for Write);
    -a Core B Requests Line (also for Write);
    -a L2 Cache sends a copy to A;
    -a L2 Cache sends a copy to B.
    A and B now have incoherent copies.

    Versus Say:
    -a Core A Requests Line (for Write);
    -a Core B Requests Line (also for Write);
    -a L2 Cache sends a copy to A;
    -a L2 Cache rejects B's Request;
    -a L2 Cache sense a request to A to write line back;
    -a Core A writes line back (flushing it locally);
    -a (Maybe) L2 signals to Core B that the line is now available.
    -a Core B Requests Line again (retry);
    -a L2 Cache sends a copy to B.


    In my approach, I went with incoherent caches, but with a special
    Volatile mechanism for some cases, say:
    -a Core A requests a line for Volatile Write;
    -a Core B Requests Line (also for Volatile Write);
    -a L2 Cache sends a copy to A;
    -a L2 Cache ignores B's Request (it can cycle the ring some more);
    -a-a-a L2 cache can track volatile lines and see that it is in-use.
    -a Core A writes back line and flushes local copy;
    -a-a-a L2 cache then marks the volatile access as complete.
    -a L2 Cache sends a copy to B
    -a-a-a Via the original request cycling around and hitting L2 again
    -a Core B writes back line and flushes local copy;
    -a-a-a L2 cache then marks the volatile access as complete.

    Because volatile accesses flush the cached dirty lines immediately, this means that there is a performance penalty, but these accesses can remain coherent (but without the impact of trying to make all memory coherent).


    For something like an inter-processor JIT, this would alas still require flushing the L1 caches in a way that is coordinated between threads.

    Normally, the mutex mechanism does not include I$ flushes, though one possibility could be to have, say, a separate JIT mutex lock, where if threads (upon trying to lock a mutex) see a JIT Sequence Number that
    does not match the expected value for that mutex on that processor core,
    it also triggers an I$ flush.

    Say:
    -a JIT Lock:
    -a-a-a Flush Caches;
    -a-a-a Lock Mutex;
    -a-a-a Increment JIT Sequence Number (JSN).
    -a Do stuff;
    -a Flush Caches;
    -a Unlock Mutex;
    -a-a-a Flush Caches;
    -a-a-a Set mutex to unlocked.





    -a Lock Mutex (Normal):
    -a-a-a Flush Caches;


    Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire. and #LoadStore | #StoreStore for release. No #StoreLoad ordering.

    -a-a-a Lock Mutex;
    -a-a-a Check JSN against cores' current JSN;
    -a-a-a-a-a If mismatch, flush I$ and update core's JSN.
    -a-a-a-a-a Likely all via CPUID and a lookup table, not new arch.
    -a Do Stuff;
    -a Unlock Mutex:
    -a-a-a ...
    -a ...



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri May 15 13:11:02 2026
    From Newsgroup: comp.arch

    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    Can't the GBOoO send the LD out early/speculatively, and do a kind of branch-recovery if that memory location is later modified is a way that
    changes what the LD should have received?

    Of course, that too comes with a cost (that of keeping track of all
    those memory accesses that may have to be re-done), but it's not obvious
    to me that it would necessarily be impractical.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 16 05:57:47 2026
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    Can't the GBOoO send the LD out early/speculatively, and do a kind of >branch-recovery if that memory location is later modified is a way that >changes what the LD should have received?

    Of course it can, although I would not call it "branch" recovery.

    The person you cited without attribution (to protect the guilty?)
    exhibits what I called the laziness of hardware designers: Instead of
    thinking how to implement sequential consistency efficiently, they
    think about rationalizations for not doing so.

    Of course, that too comes with a cost (that of keeping track of all
    those memory accesses that may have to be re-done), but it's not obvious
    to me that it would necessarily be impractical.

    Yes, the whole architectural state of the core would have to be reset.
    The major challenge for using the classical implementation of
    speculative execution (with, register renaming, speculative store
    buffer, and reorder buffer) is the worst-case latency of inter-core communication. E.g., I see at <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
    the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
    (multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
    newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
    range, and I expect that if an architecture provides sequential
    consistency, there are more incentives to bring that latency number
    down. OTOH, with multi-socket machines, the latency tends to be
    higher. Anyway, let's work with the 90ns number. That's about 500
    cycles at the higher Zen5 clock rates, and is 4000 potential
    instruction slots; the Zen5 ROB only has 448 entries, so one probably
    will not extend the ROB approach to deal with sequential consistency.
    A snapshot-and-recovery mechanism might work, based on epochs on the
    order of the maximum communication latency.

    Then we have to think about how to prevent (not mitigate) Spectre for
    such a mechanism; yes, hardware designers currently don't do anything
    about preventing Spectre, and they probably will not do anything if
    they ever implement sequential consistency, but I think they should,
    and so I also think that one needs a way to implement sequential
    consistency efficiently that can be combined with an efficient
    prevention of Spectre. Note how speculative side channel attacks were
    the final death sentence for TSX.

    Concerning performance costs, whenever a conflict is detected, one way
    of recovery would be to reset all cores to the architectural state of
    the last snapshot before the conflict happened. One can probably find
    less draconic ways to ensure consistency, but I consider them to be optimizations. One optimization might be to predict the conflict and
    hold back the corresponding load such that no conflict happens and no
    reset is necessary. Another might be to find out which cores
    communicate, and only reset those that have talked to each other since
    the snapshot.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 16 18:04:55 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    Can't the GBOoO send the LD out early/speculatively, and do a kind of >branch-recovery if that memory location is later modified is a way that >changes what the LD should have received?

    Of course it can, although I would not call it "branch" recovery.

    The person you cited without attribution (to protect the guilty?)
    exhibits what I called the laziness of hardware designers: Instead of thinking how to implement sequential consistency efficiently, they
    think about rationalizations for not doing so.

    Of course, that too comes with a cost (that of keeping track of all
    those memory accesses that may have to be re-done), but it's not obvious
    to me that it would necessarily be impractical.

    Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as
    a failure. How does one recover that ??

    So, yes, you can recover this CPU's state, but no, you cannot precisely
    recover the other CPU's state precisely.

    Yes, the whole architectural state of the core would have to be reset.
    The major challenge for using the classical implementation of
    speculative execution (with, register renaming, speculative store
    buffer, and reorder buffer) is the worst-case latency of inter-core communication. E.g., I see at <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
    the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
    (multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
    newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
    range, and I expect that if an architecture provides sequential
    consistency, there are more incentives to bring that latency number
    down. OTOH, with multi-socket machines, the latency tends to be
    higher. Anyway, let's work with the 90ns number. That's about 500
    cycles at the higher Zen5 clock rates, and is 4000 potential
    instruction slots; the Zen5 ROB only has 448 entries, so one probably
    will not extend the ROB approach to deal with sequential consistency.
    A snapshot-and-recovery mechanism might work, based on epochs on the
    order of the maximum communication latency.

    And that only recovers the state, not the intent of the state (above).

    Then we have to think about how to prevent (not mitigate) Spectre for
    such a mechanism; yes, hardware designers currently don't do anything
    about preventing Spectre, and they probably will not do anything if
    they ever implement sequential consistency, but I think they should,
    and so I also think that one needs a way to implement sequential
    consistency efficiently that can be combined with an efficient
    prevention of Spectre. Note how speculative side channel attacks were
    the final death sentence for TSX.

    Given that ST to LD ordering is an inherent part of SC, a SC machine
    will not be able to use as large an execution window as a Casually
    Consistent machine.

    Concerning performance costs, whenever a conflict is detected, one way
    of recovery would be to reset all cores to the architectural state of
    the last snapshot before the conflict happened.

    Just broadcasting that it needs to be recovered on a multi-chip multi- processor is going to take on the order of 1000 instructions (200 ns).

    One can probably find
    less draconic ways to ensure consistency, but I consider them to be optimizations. One optimization might be to predict the conflict and
    hold back the corresponding load such that no conflict happens and no
    reset is necessary.

    Hardwiring ST-to-LD 'pin' ordering is, in effect, Sequential Consistency.

    Another might be to find out which cores
    communicate, and only reset those that have talked to each other since
    the snapshot.

    Given 256 cores across 4-chips, this represents 256,000 instructions
    of recovery buffering. ... So, I doubt this is practicable even if
    feasible.


    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat May 16 20:50:49 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    Can't the GBOoO send the LD out early/speculatively, and do a kind of
    branch-recovery if that memory location is later modified is a way that
    changes what the LD should have received?

    Of course it can, although I would not call it "branch" recovery.

    The person you cited without attribution (to protect the guilty?)
    exhibits what I called the laziness of hardware designers: Instead of
    thinking how to implement sequential consistency efficiently, they
    think about rationalizations for not doing so.

    Of course, that too comes with a cost (that of keeping track of all
    those memory accesses that may have to be re-done), but it's not obvious >>> to me that it would necessarily be impractical.

    Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as
    a failure. How does one recover that ??

    So, yes, you can recover this CPU's state, but no, you cannot precisely recover the other CPU's state precisely.

    Yes, the whole architectural state of the core would have to be reset.
    The major challenge for using the classical implementation of
    speculative execution (with, register renaming, speculative store
    buffer, and reorder buffer) is the worst-case latency of inter-core
    communication. E.g., I see at
    <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
    the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
    (multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
    newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
    range, and I expect that if an architecture provides sequential
    consistency, there are more incentives to bring that latency number
    down. OTOH, with multi-socket machines, the latency tends to be
    higher. Anyway, let's work with the 90ns number. That's about 500
    cycles at the higher Zen5 clock rates, and is 4000 potential
    instruction slots; the Zen5 ROB only has 448 entries, so one probably
    will not extend the ROB approach to deal with sequential consistency.
    A snapshot-and-recovery mechanism might work, based on epochs on the
    order of the maximum communication latency.

    And that only recovers the state, not the intent of the state (above).

    Then we have to think about how to prevent (not mitigate) Spectre for
    such a mechanism; yes, hardware designers currently don't do anything
    about preventing Spectre, and they probably will not do anything if
    they ever implement sequential consistency, but I think they should,
    and so I also think that one needs a way to implement sequential
    consistency efficiently that can be combined with an efficient
    prevention of Spectre. Note how speculative side channel attacks were
    the final death sentence for TSX.

    Given that ST to LD ordering is an inherent part of SC, a SC machine
    will not be able to use as large an execution window as a Casually
    Consistent machine.

    Casually -> Causally ?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 16 23:01:00 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    Can't the GBOoO send the LD out early/speculatively, and do a kind of
    branch-recovery if that memory location is later modified is a way that >>> changes what the LD should have received?

    Of course it can, although I would not call it "branch" recovery.

    The person you cited without attribution (to protect the guilty?)
    exhibits what I called the laziness of hardware designers: Instead of
    thinking how to implement sequential consistency efficiently, they
    think about rationalizations for not doing so.

    Of course, that too comes with a cost (that of keeping track of all
    those memory accesses that may have to be re-done), but it's not obvious >>> to me that it would necessarily be impractical.

    Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as a failure. How does one recover that ??

    So, yes, you can recover this CPU's state, but no, you cannot precisely recover the other CPU's state precisely.

    Yes, the whole architectural state of the core would have to be reset.
    The major challenge for using the classical implementation of
    speculative execution (with, register renaming, speculative store
    buffer, and reorder buffer) is the worst-case latency of inter-core
    communication. E.g., I see at
    <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
    the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
    (multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
    newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
    range, and I expect that if an architecture provides sequential
    consistency, there are more incentives to bring that latency number
    down. OTOH, with multi-socket machines, the latency tends to be
    higher. Anyway, let's work with the 90ns number. That's about 500
    cycles at the higher Zen5 clock rates, and is 4000 potential
    instruction slots; the Zen5 ROB only has 448 entries, so one probably
    will not extend the ROB approach to deal with sequential consistency.
    A snapshot-and-recovery mechanism might work, based on epochs on the
    order of the maximum communication latency.

    And that only recovers the state, not the intent of the state (above).

    Then we have to think about how to prevent (not mitigate) Spectre for
    such a mechanism; yes, hardware designers currently don't do anything
    about preventing Spectre, and they probably will not do anything if
    they ever implement sequential consistency, but I think they should,
    and so I also think that one needs a way to implement sequential
    consistency efficiently that can be combined with an efficient
    prevention of Spectre. Note how speculative side channel attacks were
    the final death sentence for TSX.

    Given that ST to LD ordering is an inherent part of SC, a SC machine
    will not be able to use as large an execution window as a Casually Consistent machine.

    Casually -> Causally ?

    Friggen spelling corrector.....

    Terje


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun May 17 12:16:04 2026
    From Newsgroup: comp.arch

    On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:
    On 5/14/2026 10:22 AM, BGB wrote:
    On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory.

    Also the Motorola 68040.

    Upon updating instructions (e.g., from a JIT compiler), they require
    that the modifying thread(s) write the lines back from the data
    cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on
    the executing threads are extremely difficult to meet if the
    executing threads run independently of the modifying thread(s). Or,
    in short, IA-32 and AMD64 did the right architecture for that.

    One technique for implementing lexical binding and functions as
    first-class objects involves generating code at run-time. Some people
    would immediately gasp and say rCLself-modifying coderCY as soon as I
    mentioned this, even though the two are quite different things.

    I think itrCOs quite desirable that an architecture guarantees that an
    (to coin a phrase) rCLinstruction viewrCY versus a rCLdata viewrCY of the same
    memory location will never show different values.

    Sometimes, there is a difference between "nice to have", vs "cost
    effective".

    It is nicer, say, to have I$/D$ coherence, and to not require explicit
    flushing and invalidation. Or, say, to have caches that are implicitly
    coherent between threads (Core A stores to a location, Core B loads
    from that location, Core B sees what Core A stored).


    The requirements to pull all this off in practice may add significant
    costs; and also in ways where the performance cost of the coherence
    mechanisms tend to scale upwards as core counts increase.

    Say, for example, if one has coherent caches, software that depends on
    the cache-coherent behavior, and much more than 2 or 4 cores, it is
    not difficult to imagine scenarios where waiting on cache-coherence
    mechanisms becomes a more significant cost than actual memory-transfer
    bandwidth on the bus.


    Say, typical scenario with incoherent caches:
    -a-a Core A Requests Line (for Write);
    -a-a Core B Requests Line (also for Write);
    -a-a L2 Cache sends a copy to A;
    -a-a L2 Cache sends a copy to B.
    A and B now have incoherent copies.

    Versus Say:
    -a-a Core A Requests Line (for Write);
    -a-a Core B Requests Line (also for Write);
    -a-a L2 Cache sends a copy to A;
    -a-a L2 Cache rejects B's Request;
    -a-a L2 Cache sense a request to A to write line back;
    -a-a Core A writes line back (flushing it locally);
    -a-a (Maybe) L2 signals to Core B that the line is now available.
    -a-a Core B Requests Line again (retry);
    -a-a L2 Cache sends a copy to B.


    In my approach, I went with incoherent caches, but with a special
    Volatile mechanism for some cases, say:
    -a-a Core A requests a line for Volatile Write;
    -a-a Core B Requests Line (also for Volatile Write);
    -a-a L2 Cache sends a copy to A;
    -a-a L2 Cache ignores B's Request (it can cycle the ring some more);
    -a-a-a-a L2 cache can track volatile lines and see that it is in-use.
    -a-a Core A writes back line and flushes local copy;
    -a-a-a-a L2 cache then marks the volatile access as complete.
    -a-a L2 Cache sends a copy to B
    -a-a-a-a Via the original request cycling around and hitting L2 again
    -a-a Core B writes back line and flushes local copy;
    -a-a-a-a L2 cache then marks the volatile access as complete.

    Because volatile accesses flush the cached dirty lines immediately,
    this means that there is a performance penalty, but these accesses can
    remain coherent (but without the impact of trying to make all memory
    coherent).


    For something like an inter-processor JIT, this would alas still
    require flushing the L1 caches in a way that is coordinated between
    threads.

    Normally, the mutex mechanism does not include I$ flushes, though one
    possibility could be to have, say, a separate JIT mutex lock, where if
    threads (upon trying to lock a mutex) see a JIT Sequence Number that
    does not match the expected value for that mutex on that processor
    core, it also triggers an I$ flush.

    Say:
    -a-a JIT Lock:
    -a-a-a-a Flush Caches;
    -a-a-a-a Lock Mutex;
    -a-a-a-a Increment JIT Sequence Number (JSN).
    -a-a Do stuff;
    -a-a Flush Caches;
    -a-a Unlock Mutex;
    -a-a-a-a Flush Caches;
    -a-a-a-a Set mutex to unlocked.





    -a-a Lock Mutex (Normal):
    -a-a-a-a Flush Caches;


    Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire. and #LoadStore | #StoreStore for release. No #StoreLoad ordering.


    Cache Flushing on Mutex Lock:
    Anything that was in-memory is now written back;
    Cache is ready to accept new (non-stale data).

    Cache Flush on Mutex Unlock:
    Anything dirty in cache during time mutex was held is now written back;
    ...

    This causes mutex lock/unlock to become a sort of memory ordering event.


    It is sort of needed for a weak model to work for multi-core
    multi-threading and not just end up exploding (and some practices will
    still not work as they would on a core with stronger memory ordering and
    cache coherence).

    Can skip the flushing though in cases where a mutex is being used only
    being used from a single core (since memory is coherent within a core).


    -a-a-a-a Lock Mutex;
    -a-a-a-a Check JSN against cores' current JSN;
    -a-a-a-a-a-a If mismatch, flush I$ and update core's JSN.
    -a-a-a-a-a-a Likely all via CPUID and a lookup table, not new arch.
    -a-a Do Stuff;
    -a-a Unlock Mutex:
    -a-a-a-a ...
    -a-a ...




    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 17 22:08:08 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:
    On 5/14/2026 10:22 AM, BGB wrote:
    On 5/13/2026 2:02 AM, Lawrence DrCOOliveiro wrote:
    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
    -------------------Why can't BGB clip unnecessary lines in the thread ???
    Cache Flushing on Mutex Lock:
    Anything that was in-memory is now written back;
    Cache is ready to accept new (non-stale data).

    Cache Flush on Mutex Unlock:
    Anything dirty in cache during time mutex was held is now written back;
    ...

    This causes mutex lock/unlock to become a sort of memory ordering event.

    This is the means by which My 66000 presents all participating cache
    line in {as before or as after} in a single instant to the rest of the
    system. Here, the trigger is the STL (equivalent to SC) and a check for
    no deleterious interference.

    It is sort of needed for a weak model to work for multi-core
    multi-threading and not just end up exploding (and some practices will
    still not work as they would on a core with stronger memory ordering and cache coherence).

    At the start of an ATOMIC event My 66000 reverts to sequential consistency.
    At the termination My 66000 reverts to causal consistency.

    --- Synchronet 3.22a-Linux NewsLink 1.2