• Accepting the Sense of Some of Mitch Alsup's Advice

    From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Dec 18 19:10:15 2025
    From Newsgroup: comp.arch

    Given the great popularity of the RISC architecture, I assumed that one of
    its characteristics, instructions that are all 32 bits in length, produced
    a great increase in efficiency over variable-length instructions.
    Therefore, I came up with the idea of using some opcode space for block headers which could contain information about the lengths of instructions,
    so as to make decoding variable-length instructions fully non-serialized,
    thus giving me the best of both worlds.
    However, this involved overhead, and the headers would themselves take
    time to decode. In any event, all the schemes I came up with were also elaborate and overly complicated.
    But I have finally realized what I think is the decisive reason why I had
    been mistaken.
    Before modern pipelined computers, which have multi-stage pipelines for instruction _execution_, a simple form of pipelining was very common -
    usually in the form of a three-stage fetch, decode, and execute pipeline.
    Since the decoding of instructions can be so neatly separated from their execution, and thus performed well in advance of it, any overhead
    associated with variable-length instructions becomes irrelevant because it essentially takes place very nearly completely in parallel to execution.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 18 21:29:00 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    Given the great popularity of the RISC architecture, I assumed that one of its characteristics, instructions that are all 32 bits in length, produced
    a great increase in efficiency over variable-length instructions.

    Yes, and no:: and depends on what you are attributing "efficient" to.

    Therefore, I came up with the idea of using some opcode space for block headers which could contain information about the lengths of instructions, so as to make decoding variable-length instructions fully non-serialized, thus giving me the best of both worlds.

    I put the VLE instruction bits in a place where I get the benefits of VLE without the detriments of VLE wrt parallel decode. With my current scheme
    I can parse 16+instructions in a 16-gates-of-delay cycle.

    However, this involved overhead, and the headers would themselves take
    time to decode. In any event, all the schemes I came up with were also elaborate and overly complicated.

    As I warned...

    But I have finally realized what I think is the decisive reason why I had been mistaken.

    At Last ?!?

    Before modern pipelined computers, which have multi-stage pipelines for instruction _execution_, a simple form of pipelining was very common - usually in the form of a three-stage fetch, decode, and execute pipeline.

    Since the decoding of instructions can be so neatly separated from their execution, and thus performed well in advance of it, any overhead
    associated with variable-length instructions becomes irrelevant because it essentially takes place very nearly completely in parallel to execution.

    Or in other words, if you can decode K-instructions per cycle, you'd better
    be able to execute K-instructions per cycle--or you have a serious blockage
    in your pipeline.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Dec 18 22:25:08 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> writes:
    Given the great popularity of the RISC architecture, I assumed that one of >its characteristics, instructions that are all 32 bits in length, produced
    a great increase in efficiency over variable-length instructions.

    Some RISCs have that, some RISCs have two instruction lengths: 16 bits
    and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
    but then eliminated in Power), one variant of Berkeley RISC, ARM T32,
    RISC-V with the C extension, and probably others.

    Before modern pipelined computers, which have multi-stage pipelines for >instruction _execution_, a simple form of pipelining was very common - >usually in the form of a three-stage fetch, decode, and execute pipeline. >Since the decoding of instructions can be so neatly separated from their >execution, and thus performed well in advance of it, any overhead
    associated with variable-length instructions becomes irrelevant because it >essentially takes place very nearly completely in parallel to execution.

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually correspond to the end of the previous instruction, but with 16-bit and
    32-bit instructions this potentially doubles the amount of instruction
    decoders necessary, plus the circuit for selecting the ones that are
    at actual instruction starts. I guess that this is the reason why ARM
    uses an uop cache in cores that can execute ARM T32. The fact that
    more recent ARM A64-only cores have often no uop cache while their
    A64+T32 predecessors have had one reinforces this idea.

    OTOH, on AMD64/IA-32 Intel's recent E-Cores do not use an uop cache
    either, but instead the most recent instances have 3 decoders each of
    which can decode 3 instructions per cycle (i.e., they attempt to
    decode at many more positions and then select 3 per cycle out of
    those); so apparently even byte-oriented variable-length encoding can
    be decoded quickly enough.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Dec 18 23:08:00 2025
    From Newsgroup: comp.arch

    On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually correspond to the end of the previous instruction,

    Oh, yes, I had always realized that, but dismissed it as far too wasteful.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Dec 18 23:18:45 2025
    From Newsgroup: comp.arch

    On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
    John Savard <quadibloc@invalid.invalid> posted:

    However, this involved overhead, and the headers would themselves take
    time to decode. In any event, all the schemes I came up with were also
    elaborate and overly complicated.

    As I warned...

    I wasn't blind - of course I knew that all along. But what I still did
    fail to see was any good alternative.

    But I have finally realized what I think is the decisive reason why I
    had been mistaken.

    At Last ?!?

    On further reflection, I think I had realized that decoding is done ahead
    of execution, and thus can be thought of as mostly done in parallel with
    it, before. But decoding still has to be done *first* before execution can start. So I felt that in a design where super-aggressive pipelining or vectorization allows many instructions to be done in parallel, if decoding
    is necessarily serial, it could still become a bottleneck.

    Before modern pipelined computers, which have multi-stage pipelines for
    instruction _execution_, a simple form of pipelining was very common -
    usually in the form of a three-stage fetch, decode, and execute
    pipeline.

    Since the decoding of instructions can be so neatly separated from
    their execution, and thus performed well in advance of it, any overhead
    associated with variable-length instructions becomes irrelevant because
    it essentially takes place very nearly completely in parallel to
    execution.

    Or in other words, if you can decode K-instructions per cycle, you'd
    better be able to execute K-instructions per cycle--or you have a
    serious blockage in your pipeline.

    No.

    If you flipped "decode" and "execute" in that sentence above, I would 100% agree. And maybe this _is_ just a typo.

    But if you actually did mean that sentence exactly as written, I would disagree. This is why: I regard executing instructions as 'doing the
    actual work' and decoding instructions as... some unfortunate trivial
    overhead that can't be avoided.

    Hence, if I can decode instructions much faster than I can execute them... _possibly_ the decoder is overdesigned, but it's also perfectly possible
    that there isn't really a slower decoder design that would make sense.

    And maybe this perspective _explains_ why I dabbled in elaborate schemes
    to allow decoding in parallel. I absolutely refused to allow decoding to become a bottleneck, no matter how aggressively OoO the execution part is designed for breakneck speed at all costs.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 18 23:54:34 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Savard <quadibloc@invalid.invalid> writes:
    Given the great popularity of the RISC architecture, I assumed that one of >its characteristics, instructions that are all 32 bits in length, produced >a great increase in efficiency over variable-length instructions.

    Some RISCs have that, some RISCs have two instruction lengths: 16 bits
    and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
    but then eliminated in Power), one variant of Berkeley RISC, ARM T32,
    RISC-V with the C extension, and probably others.

    But a computer with 16-bit and 32-bit instructions HAS Variable
    Length Instructions !!! with all the difficulties involved, and
    without solving the part of ISA that adds 15% to instruction
    count. And in particular, RISC-V has to decide if the instruction
    is 16-bits of 32-bits before it multiplexes out the register
    specifier fields to the RF decoder.

    Instruction Fusion is significantly harder than VLI parsing ...

    Before modern pipelined computers, which have multi-stage pipelines for >instruction _execution_, a simple form of pipelining was very common - >usually in the form of a three-stage fetch, decode, and execute pipeline. >Since the decoding of instructions can be so neatly separated from their >execution, and thus performed well in advance of it, any overhead >associated with variable-length instructions becomes irrelevant because it >essentially takes place very nearly completely in parallel to execution.

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually correspond to the end of the previous instruction, but with 16-bit and
    32-bit instructions this potentially doubles the amount of instruction decoders necessary, plus the circuit for selecting the ones that are
    at actual instruction starts.

    None of the recent dies I have looked at has a decoder big enough to
    SEE !!--including x86s. Although the decoder is too small to see, it
    IS, however, big enough power drain to cause liquid crystals to blow
    up like a fountain (volcano is another appropriate word). fast wide
    x86 decode requires a 8-bit decoder on every instruction byte {mostly
    1:256 decoders with patter recognizers at the output}

    Yes, one can see the predictors, the Instruction Queue (prior to decode)
    but the decoders themselves are miniscule.

    I guess that this is the reason why ARM
    uses an uop cache in cores that can execute ARM T32. The fact that
    more recent ARM A64-only cores have often no uop cache while their
    A64+T32 predecessors have had one reinforces this idea.

    OTOH, on AMD64/IA-32 Intel's recent E-Cores do not use an uop cache
    either, but instead the most recent instances have 3 decoders each of
    which can decode 3 instructions per cycle (i.e., they attempt to
    decode at many more positions and then select 3 per cycle out of
    those); so apparently even byte-oriented variable-length encoding can
    be decoded quickly enough.

    If you give the pipeline enough cycles, every decode means "treeifies"
    sooner or later--or one can architect the ISA so that it "treeifies"
    early in the decode-pipeline and save a few cycles.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Dec 19 03:30:26 2025
    From Newsgroup: comp.arch

    According to John Savard <quadibloc@invalid.invalid>:
    Therefore, I came up with the idea of using some opcode space for block >headers which could contain information about the lengths of instructions, >so as to make decoding variable-length instructions fully non-serialized, >thus giving me the best of both worlds.

    Sounds like the first two bits of the opcode in S/360 which tells you the instruction format which also tells you how long the instruction is.

    They've added lots of new instructions since then with somewhat
    different formats, but those bits still tell you how long the
    instruction is. The first byte tells you what the fornat is so you
    know what address calculations to do.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Dec 19 12:36:07 2025
    From Newsgroup: comp.arch

    On Fri, 19 Dec 2025 03:30:26 -0000 (UTC)
    John Levine <johnl@taugh.com> wrote:

    According to John Savard <quadibloc@invalid.invalid>:
    Therefore, I came up with the idea of using some opcode space for
    block headers which could contain information about the lengths of >instructions, so as to make decoding variable-length instructions
    fully non-serialized, thus giving me the best of both worlds.

    Sounds like the first two bits of the opcode in S/360 which tells you
    the instruction format which also tells you how long the instruction
    is.

    They've added lots of new instructions since then with somewhat
    different formats, but those bits still tell you how long the
    instruction is. The first byte tells you what the fornat is so you
    know what address calculations to do.



    With very long pipelines that IBM is using starting from z10 (17 years
    ago) it probably makes no difference.
    The fact that there are only 3 options for instruction length is
    important and simplifying things relatively to more than dozen of
    options in x86, but how many bits one has to access in order to
    determine the length of instruction is irrelevant or close to
    irrelevant as long as they all reside near beginning rather than
    anywhere, like in VAX.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Dec 19 17:41:36 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> writes:
    On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
    Or in other words, if you can decode K-instructions per cycle, you'd
    better be able to execute K-instructions per cycle--or you have a
    serious blockage in your pipeline.

    No.

    If you flipped "decode" and "execute" in that sentence above, I would 100% >agree. And maybe this _is_ just a typo.

    But if you actually did mean that sentence exactly as written, I would >disagree. This is why: I regard executing instructions as 'doing the
    actual work' and decoding instructions as... some unfortunate trivial >overhead that can't be avoided.

    It does not matter what "the actual work" is and what isn't. What
    matters is how expensive it is to make a particular part wider, and
    how paying that cost benefits the IPC. At every step you add width to
    the part with the best benefit/cost ratio.

    And looking at recent cores, we see that, e.g., Skymont can decode
    3x3=9 instructions per cycle, rename 8 per cycle, has 26 ports to
    functional units (i.e., can execute 26 uops in one cycle); I don't
    know how many instructions it can retire per cycle, but I expect that
    it is more than 8 per cycle.

    So the renamer is the bottleneck, and that's also the idea behind
    top-down microarchitecture analysis (TMA) for determining how software interacts with the microarchitecture. That idea is coming out of
    Intel, but if Intel is finding it hard to make wider renamers rather
    than wider other parts, I expect that the rest of the industry also
    finds that hard (especially for architectures where decoding is
    cheaper), and (looking at ARM A64) where instructions with more
    demands on the renamer exist.

    Concerning the question what is doing "the actual work", it's
    obviously committing the instruction in the ROB. Up to that point,
    the instruction is speculative, only with the commit it becomes
    architectural.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 19 18:53:18 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Fri, 19 Dec 2025 03:30:26 -0000 (UTC)
    John Levine <johnl@taugh.com> wrote:

    According to John Savard <quadibloc@invalid.invalid>:
    Therefore, I came up with the idea of using some opcode space for
    block headers which could contain information about the lengths of >instructions, so as to make decoding variable-length instructions
    fully non-serialized, thus giving me the best of both worlds.

    Sounds like the first two bits of the opcode in S/360 which tells you
    the instruction format which also tells you how long the instruction
    is.

    They've added lots of new instructions since then with somewhat
    different formats, but those bits still tell you how long the
    instruction is. The first byte tells you what the fornat is so you
    know what address calculations to do.



    With very long pipelines that IBM is using starting from z10 (17 years
    ago) it probably makes no difference.
    The fact that there are only 3 options for instruction length is
    important and simplifying things relatively to more than dozen of
    options in x86, but how many bits one has to access in order to
    determine the length of instruction is irrelevant or close to
    irrelevant as long as they all reside near beginning rather than
    anywhere, like in VAX.

    My VLI stuff is all encoded in the first word (32-bits) of the instruction.
    And (now) all extensions come in 32-bit quanta, and its all encoded in
    4-bits.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Dec 19 15:07:38 2025
    From Newsgroup: comp.arch

    On 12/18/2025 4:25 PM, Anton Ertl wrote:
    John Savard <quadibloc@invalid.invalid> writes:
    Given the great popularity of the RISC architecture, I assumed that one of >> its characteristics, instructions that are all 32 bits in length, produced >> a great increase in efficiency over variable-length instructions.

    Some RISCs have that, some RISCs have two instruction lengths: 16 bits
    and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
    but then eliminated in Power), one variant of Berkeley RISC, ARM T32,
    RISC-V with the C extension, and probably others.


    I have come to realize that 32/64 is probably better than 16/32 here, primarily in terms of performance, but also helps with code-density (a
    pure 32/64 encoding scheme can beat 16/32 in terms of code-density
    despite only having larger instructions available).

    One could argue "But MOV is less space efficient", can note that it also
    makes sense to try to design the compiler to minimize the number of unnecessary MOV instructions and similar (and when using the minimal
    number of register moves, the lack of a small MOV encoding has less
    effect on code density).


    16/32/64 is also sensible, but the existence of 16-bit ops negatively
    effects encoding space (it is more of a strain to have both 16-bit ops
    and 6-bit register fields; but at least some code can benefit from
    having 64 GPRs).

    So, say:
    16/32: RV64GC (OK code density)
    16/32/64: RV64GC+JX: Better code density than RV64GC.
    32/64: RV64G+JX (seemingly slightly beats RV64GC)
    But, not as much as GC+JX.
    16/32/64/96: XG1 (still best code for density).
    32/64/96: XG2 and XG3;
    Also good for code density;
    Somehow XG3 loses to XG2 despite being nearly 1:1;
    Though, XG3 has mostly claimed the performance crown.

    Or, descending, code-density:
    XG1, RV64GC+JX, XG2, RV64G+JX, XG3, RV64GC, RV64G
    And, performance:
    XG3, XG2, RV64G+JX, XG1, RV64GC+JX, RV64G, RV64GC


    Where, both the 16-bit ops, and some lacking features (in RV64G and
    RV64GC) negatively effecting things.

    Where, the main things that benefit JX here being:
    Jumbo prefixes, extending Imm12/Disp12 to 33 bits;
    Indexed Load/Store;
    Load/Store Pair;
    Re-adding ADDWU/SUBWU and similar.
    The Zba instructions also help,
    but Load/Store pair greatly reduces effect of Zba.


    It would be possible to get better code density than 'C' with some tweaks:
    Reducing many of the imm/disp fields by 1 bit;
    Would free up a lot of encoding space.
    Imm6/Disp6 eats too much encoding space here.
    Making most of the register fields 4 bits (X8..X23)
    Can improve hit-rate notably over Reg3.

    But:
    Main merit of 'C' is compatibility with binaries that use 'C';
    This merit would be lost by modifying or replacing 'C'.


    Had mostly ended up leaving out 96-bit encodings for RV+JX mostly
    because the encoding scheme kinda ruins it in this case (not really a
    good way to fix RISC-V's encodings to make it not suck).

    Tempting to consider collapsing the 80+ bit space into 96 or 96/128.

    So, say:
    xx-xxx-00: 16-bit
    xx-xxx-01: 16-bit
    xx-xxx-10: 16-bit
    xx-xxx-11: 32+ bits
    x0-111-11: 48 bits
    01-111-11: 64 bits
    11-111-11: 80+ bits (uses Func3 for 16-bit count)
    But, say:
    11-111-11: 96 bits
    Or, say:
    ...-xx0-nnnnn-11-111-11: 96 bits
    ...-001-nnnnn-11-111-11: 128 bits
    ...-011-nnnnn-11-111-11: 192 bits (eventual)
    ...-101-nnnnn-11-111-11: 256 bits (eventual)
    ...-111-nnnnn-11-111-11: 384 bits (eventual)

    Replacing: 80/96/112/128/144/160/176/192.
    Don't really need fine grained bucket sizes for large encodings.

    Where, having a few more bits for 96-bit ops makes them more usable.
    The main use for 96-bit here being for possible Imm64 ALU encodings.

    At least with my existing JX scheme, there is not enough encoding space
    to allow for Imm64 ALU encodings (which reduces the usefulness of having 96-bit encodings).




    Before modern pipelined computers, which have multi-stage pipelines for
    instruction _execution_, a simple form of pipelining was very common -
    usually in the form of a three-stage fetch, decode, and execute pipeline.
    Since the decoding of instructions can be so neatly separated from their
    execution, and thus performed well in advance of it, any overhead
    associated with variable-length instructions becomes irrelevant because it >> essentially takes place very nearly completely in parallel to execution.

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually correspond to the end of the previous instruction, but with 16-bit and
    32-bit instructions this potentially doubles the amount of instruction decoders necessary, plus the circuit for selecting the ones that are
    at actual instruction starts. I guess that this is the reason why ARM
    uses an uop cache in cores that can execute ARM T32. The fact that
    more recent ARM A64-only cores have often no uop cache while their
    A64+T32 predecessors have had one reinforces this idea.


    I took the option of not bothering with parallel execution for 16-bit ops.

    This does leave both XG1 and RV64GC (when using Compressed encodings) at
    a performance disadvantage. But, dealing with superscalar decoding for
    16-bit ops would add too much cost here.

    For an ISA like RV64GC, it could be possible in theory (if the compiler
    knows which functions are in the hot and cold paths) to use 16-bit
    encodings in the cold path but then only 32-bit encodings in the hot
    path (which also need to be kept 32-bit aligned).


    Even if 16-bit ops could be superscalar though, the benefits would be
    small: Code patterns that favor 16-bit ops also tend to be lower in
    terms of available ILP.

    Or, the reverse:
    Patterns that maximize ILP (such as unrolling and modulo-scheduling
    loops) tend to be hostile to the constraints of 16-bit encoding schemes.


    Decoding at 2 or 3 wide seems to make the most sense:
    Gets a nice speedup over 1;
    Works with in-order.

    Here, 3 is slightly better than 2.
    But, getting that much benefit from going any wider than this, is likely
    to require some amount of "heavy lifting".

    So, while a 4 or 5 wide in-order design could be possible, pretty much
    no normal code is going to have enough ILP to make it worthwhile over 2
    or 3.

    Also 2 or 3 works reasonably well with a 96-bit fetch:
    Can do 1x: 32/64/96
    Can do 2x or 3x 32-bit;
    Could do (potentially) 32+64 or 64+32.
    64-bit ops being somewhat less common than 32 bit ops.
    96-bit ops are statistically infrequent.

    Or, at least, 96-bit ops are not frequent enough to make superscalar worthwhile (but may still be "moderate frequency" on the grand scale,
    mostly for Imm64 ops and similar).



    OTOH, on AMD64/IA-32 Intel's recent E-Cores do not use an uop cache
    either, but instead the most recent instances have 3 decoders each of
    which can decode 3 instructions per cycle (i.e., they attempt to
    decode at many more positions and then select 3 per cycle out of
    those); so apparently even byte-oriented variable-length encoding can
    be decoded quickly enough.


    It is possible with x86 / x86-64, just ugly and kinda expensive.

    Likely:
    Note lengths for a Mod/RM at each position (1..5);
    Note lengths for an opcode at each position (0..3);
    Note lengths for a prefix at each position (0..3).

    Then say:
    Prefix Length (Lp) at RIP (Potentially 0..5, usually 0/1);
    Opcode Length (Lo) at RIP+Lp (Usually 1 or 2);
    Mod/Rm Length (Lrm) at RIP+Lp+Lo (1..6);
    Add 4 or 8 if Opcode has an Immediate.

    One trick here could be to precompute a lot of this when fetching cache
    lines, though a full instruction length could not be determined at fetch
    time if the instruction crosses a cache line unless we have also fetched
    the next cache line. Full instruction length could be determine in
    advance (at fetch time) if it always fetches both cache-lines and then determines the lengths for one of them before writing to the cache
    (possibly if the next line is fetched, it contents are not written to
    the cache as lengths can't be fully determined yet).


    At this stage, I sorta have an idea how one could implement an x86 core,
    but not particularly inclined to do so.

    Even if one decodes x86 efficiently, there are a few other drawbacks:
    2 register encodings for everything;
    excessive numbers of memory accesses (particularly to the stack);
    ...

    Would still be hard pressed to make the performance good absent ugly
    tricks like resorting to OoO.


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 19 23:36:54 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 12/18/2025 4:25 PM, Anton Ertl wrote:
    John Savard <quadibloc@invalid.invalid> writes:
    Given the great popularity of the RISC architecture, I assumed that one of >> its characteristics, instructions that are all 32 bits in length, produced >> a great increase in efficiency over variable-length instructions.

    Some RISCs have that, some RISCs have two instruction lengths: 16 bits
    and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
    but then eliminated in Power), one variant of Berkeley RISC, ARM T32, RISC-V with the C extension, and probably others.


    I have come to realize that 32/64 is probably better than 16/32 here, primarily in terms of performance, but also helps with code-density (a
    pure 32/64 encoding scheme can beat 16/32 in terms of code-density
    despite only having larger instructions available).

    My 66000 does not even bother with 16-bit instructions--and still ends
    up requiring fewer instruction count than RISC-V. {32, 64, 96, 128, 160}
    are the instruction sizes; with no instructions ever requiring constants
    to be assembled.

    One could argue "But MOV is less space efficient", can note that it also makes sense to try to design the compiler to minimize the number of unnecessary MOV instructions and similar (and when using the minimal
    number of register moves, the lack of a small MOV encoding has less
    effect on code density).

    Most of the MOV instructions in My 66000 are found::
    a) before a call--moving values to argument positions,
    b) after a call--moving results to post-call positions,
    c) around loops --moving values for next loop iteration.

    16/32/64 is also sensible, but the existence of 16-bit ops negatively effects encoding space (it is more of a strain to have both 16-bit ops
    and 6-bit register fields; but at least some code can benefit from
    having 64 GPRs).

    I agree than RISC-V HAS too many 16-bit instructions, and that it gains
    too little in the code density department by having them.

    So, say:
    16/32: RV64GC (OK code density)
    16/32/64: RV64GC+JX: Better code density than RV64GC.
    32/64: RV64G+JX (seemingly slightly beats RV64GC)
    But, not as much as GC+JX.
    16/32/64/96: XG1 (still best code for density).
    32/64/96: XG2 and XG3;
    Also good for code density;
    Somehow XG3 loses to XG2 despite being nearly 1:1;
    Though, XG3 has mostly claimed the performance crown.

    Or, descending, code-density:
    XG1, RV64GC+JX, XG2, RV64G+JX, XG3, RV64GC, RV64G
    And, performance:
    XG3, XG2, RV64G+JX, XG1, RV64GC+JX, RV64G, RV64GC

    Rather than tracking code density--which measures cache performance,
    I have come to think that counting instructions themselves is the key.
    If the instruction is present then it ahs to be executed, if not, then
    it was free !! in all real senses.

    Where, both the 16-bit ops, and some lacking features (in RV64G and
    RV64GC) negatively effecting things.

    Like a reasonable OpCode layout.

    Where, the main things that benefit JX here being:
    Jumbo prefixes, extending Imm12/Disp12 to 33 bits;
    I have no prefixes {well CARRY}
    -#Imm5, Imm16, Imm32, Imm64, Disp16, Disp32, Disp64
    Indexed Load/Store;
    check
    Load/Store Pair;
    LDM, STM, ENTER, EXIT, MM, MS
    Re-adding ADDWU/SUBWU and similar.
    {int,float}|u{OpCode}|u{Byte, Half, Word, DBLE}
    The Zba instructions also help,
    but Load/Store pair greatly reduces effect of Zba.


    It would be possible to get better code density than 'C' with some tweaks:
    Reducing many of the imm/disp fields by 1 bit;
    Would free up a lot of encoding space.
    Imm6/Disp6 eats too much encoding space here.
    Which I why -#imm5 works better.
    Making most of the register fields 4 bits (X8..X23)
    Can improve hit-rate notably over Reg3.

    But:
    Main merit of 'C' is compatibility with binaries that use 'C';
    This merit would be lost by modifying or replacing 'C'.

    I can still fit my entire ISA into the space vacated by C. ----------------------
    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually correspond to the end of the previous instruction, but with 16-bit and 32-bit instructions this potentially doubles the amount of instruction decoders necessary, plus the circuit for selecting the ones that are
    at actual instruction starts. I guess that this is the reason why ARM
    uses an uop cache in cores that can execute ARM T32. The fact that
    more recent ARM A64-only cores have often no uop cache while their
    A64+T32 predecessors have had one reinforces this idea.


    I took the option of not bothering with parallel execution for 16-bit ops.

    I took the option of not bothering with 16-bit Ops.
    -----------------------
    Even if 16-bit ops could be superscalar though, the benefits would be
    small: Code patterns that favor 16-bit ops also tend to be lower in
    terms of available ILP.

    I suspect that argument setup before and result take-down after call
    would have quite a bit of parallelism.
    I suspect that moving fields around for the next loop iteration would
    have significant parallelism.
    ------------------------------
    Decoding at 2 or 3 wide seems to make the most sense:
    Gets a nice speedup over 1;
    Works with in-order.

    Here, 3 is slightly better than 2.
    But, getting that much benefit from going any wider than this, is likely
    to require some amount of "heavy lifting".

    Probably no conducive to FPGA implementations due to LUT count and
    special memories {predictors, ..., TLBs, staging buffers, ...}

    So, while a 4 or 5 wide in-order design could be possible, pretty much
    no normal code is going to have enough ILP to make it worthwhile over 2
    or 3.

    1-wide 0.7 IPC
    2-wide 1.0 IPC gain of 50%
    3-wide 1.4 IPC gain of 40%
    6-wide 2.2 IPC gain of 50% from doubling the width
    10wide 3.2 IPC gain of 50% from almost doubling width

    Also 2 or 3 works reasonably well with a 96-bit fetch:

    But Fetches ae 128-bits wide !!! and the average instruction is 35-bits wide. ------------------------
    One trick here could be to precompute a lot of this when fetching cache lines, though a full instruction length could not be determined at fetch time if the instruction crosses a cache line unless we have also fetched
    the next cache line. Full instruction length could be determine in
    advance (at fetch time) if it always fetches both cache-lines and then determines the lengths for one of them before writing to the cache
    (possibly if the next line is fetched, it contents are not written to
    the cache as lengths can't be fully determined yet).

    All of the above was solved in Athlon, and then made 3|u smaller in Opteron
    at the cost of 1 pipe stage in DECODE.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Dec 19 23:24:01 2025
    From Newsgroup: comp.arch

    I have come to realize that 32/64 is probably better than 16/32 here,
    primarily in terms of performance, but also helps with code-density (a
    pure 32/64 encoding scheme can beat 16/32 in terms of code-density
    despite only having larger instructions available).

    My 66000 does not even bother with 16-bit instructions--and still ends
    up requiring fewer instruction count than RISC-V. {32, 64, 96, 128, 160}
    are the instruction sizes; with no instructions ever requiring constants
    to be assembled.

    Indeed, My 66000 aims for "fat" instructions so as to try and reduce instruction counts. That should hopefully result in an efficient ISA:
    fewer instructions should cost less runtime resources (as long as they
    don't get split into more ++ops).

    Most of the MOV instructions in My 66000 are found::
    a) before a call--moving values to argument positions,
    b) after a call--moving results to post-call positions,
    c) around loops --moving values for next loop iteration.
    [...]
    I suspect that argument setup before and result take-down after call
    would have quite a bit of parallelism. I suspect that moving fields
    around for the next loop iteration would have significant parallelism.

    Are you saying that you expect the efficiency of My 66000 could be
    improved by adding some way to express those moves in a better way?
    A key element of the Mill is/was its ability to "permute" its belt
    elements in a single cycle. I still don't fully understand how this is
    encoded in the ISA and implemented in hardware, but it sounds like
    you're hinting in the same direction: some kind of "parallel move"
    instruction with many inputs and many outputs.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 20 03:07:23 2025
    From Newsgroup: comp.arch

    On 2025-12-19 6:36 p.m., MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 12/18/2025 4:25 PM, Anton Ertl wrote:
    John Savard <quadibloc@invalid.invalid> writes:
    Given the great popularity of the RISC architecture, I assumed that one of >>>> its characteristics, instructions that are all 32 bits in length, produced >>>> a great increase in efficiency over variable-length instructions.

    Some RISCs have that, some RISCs have two instruction lengths: 16 bits
    and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
    but then eliminated in Power), one variant of Berkeley RISC, ARM T32,
    RISC-V with the C extension, and probably others.


    I have come to realize that 32/64 is probably better than 16/32 here,
    primarily in terms of performance, but also helps with code-density (a
    pure 32/64 encoding scheme can beat 16/32 in terms of code-density
    despite only having larger instructions available).

    My 66000 does not even bother with 16-bit instructions--and still ends
    up requiring fewer instruction count than RISC-V. {32, 64, 96, 128, 160}
    are the instruction sizes; with no instructions ever requiring constants
    to be assembled.

    One could argue "But MOV is less space efficient", can note that it also
    makes sense to try to design the compiler to minimize the number of
    unnecessary MOV instructions and similar (and when using the minimal
    number of register moves, the lack of a small MOV encoding has less
    effect on code density).

    Most of the MOV instructions in My 66000 are found::
    a) before a call--moving values to argument positions,
    b) after a call--moving results to post-call positions,
    c) around loops --moving values for next loop iteration.

    16/32/64 is also sensible, but the existence of 16-bit ops negatively
    effects encoding space (it is more of a strain to have both 16-bit ops
    and 6-bit register fields; but at least some code can benefit from
    having 64 GPRs).

    I agree than RISC-V HAS too many 16-bit instructions, and that it gains
    too little in the code density department by having them.

    So, say:
    16/32: RV64GC (OK code density)
    16/32/64: RV64GC+JX: Better code density than RV64GC.
    32/64: RV64G+JX (seemingly slightly beats RV64GC)
    But, not as much as GC+JX.
    16/32/64/96: XG1 (still best code for density).
    32/64/96: XG2 and XG3;
    Also good for code density;
    Somehow XG3 loses to XG2 despite being nearly 1:1;
    Though, XG3 has mostly claimed the performance crown.

    Or, descending, code-density:
    XG1, RV64GC+JX, XG2, RV64G+JX, XG3, RV64GC, RV64G
    And, performance:
    XG3, XG2, RV64G+JX, XG1, RV64GC+JX, RV64G, RV64GC

    Rather than tracking code density--which measures cache performance,
    I have come to think that counting instructions themselves is the key.
    If the instruction is present then it ahs to be executed, if not, then
    it was free !! in all real senses.

    Where, both the 16-bit ops, and some lacking features (in RV64G and
    RV64GC) negatively effecting things.

    Like a reasonable OpCode layout.

    Where, the main things that benefit JX here being:
    Jumbo prefixes, extending Imm12/Disp12 to 33 bits;
    I have no prefixes {well CARRY}
    -#Imm5, Imm16, Imm32, Imm64, Disp16, Disp32, Disp64
    Indexed Load/Store;
    check
    Load/Store Pair;
    LDM, STM, ENTER, EXIT, MM, MS
    Re-adding ADDWU/SUBWU and similar.
    {int,float}|u{OpCode}|u{Byte, Half, Word, DBLE}
    The Zba instructions also help,
    but Load/Store pair greatly reduces effect of Zba.


    It would be possible to get better code density than 'C' with some tweaks: >> Reducing many of the imm/disp fields by 1 bit;
    Would free up a lot of encoding space.
    Imm6/Disp6 eats too much encoding space here.
    Which I why -#imm5 works better.
    Making most of the register fields 4 bits (X8..X23)
    Can improve hit-rate notably over Reg3.

    But:
    Main merit of 'C' is compatibility with binaries that use 'C';
    This merit would be lost by modifying or replacing 'C'.

    I can still fit my entire ISA into the space vacated by C. ----------------------
    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually
    correspond to the end of the previous instruction, but with 16-bit and
    32-bit instructions this potentially doubles the amount of instruction
    decoders necessary, plus the circuit for selecting the ones that are
    at actual instruction starts. I guess that this is the reason why ARM
    uses an uop cache in cores that can execute ARM T32. The fact that
    more recent ARM A64-only cores have often no uop cache while their
    A64+T32 predecessors have had one reinforces this idea.


    I took the option of not bothering with parallel execution for 16-bit ops.

    I took the option of not bothering with 16-bit Ops.
    -----------------------
    Even if 16-bit ops could be superscalar though, the benefits would be
    small: Code patterns that favor 16-bit ops also tend to be lower in
    terms of available ILP.

    I suspect that argument setup before and result take-down after call
    would have quite a bit of parallelism.
    I suspect that moving fields around for the next loop iteration would
    have significant parallelism.
    ------------------------------
    Decoding at 2 or 3 wide seems to make the most sense:
    Gets a nice speedup over 1;
    Works with in-order.

    Here, 3 is slightly better than 2.
    But, getting that much benefit from going any wider than this, is likely
    to require some amount of "heavy lifting".

    Probably no conducive to FPGA implementations due to LUT count and
    special memories {predictors, ..., TLBs, staging buffers, ...}

    So, while a 4 or 5 wide in-order design could be possible, pretty much
    no normal code is going to have enough ILP to make it worthwhile over 2
    or 3.

    1-wide 0.7 IPC
    2-wide 1.0 IPC gain of 50%
    3-wide 1.4 IPC gain of 40%
    6-wide 2.2 IPC gain of 50% from doubling the width
    10wide 3.2 IPC gain of 50% from almost doubling width

    Also 2 or 3 works reasonably well with a 96-bit fetch:

    But Fetches ae 128-bits wide !!! and the average instruction is 35-bits wide.

    Could the average instruction size be an argument for the use of wider (40-bits) instructions? One would think that the instruction should be a
    bit wider than average. Bin trying to shrink Qupls4 instructions down to 40-bits for Qupls5. The odd size is not that great an issue if variable lengths are supported.

    ------------------------
    One trick here could be to precompute a lot of this when fetching cache
    lines, though a full instruction length could not be determined at fetch
    time if the instruction crosses a cache line unless we have also fetched
    the next cache line. Full instruction length could be determine in
    advance (at fetch time) if it always fetches both cache-lines and then
    determines the lengths for one of them before writing to the cache
    (possibly if the next line is fetched, it contents are not written to
    the cache as lengths can't be fully determined yet).

    All of the above was solved in Athlon, and then made 3|u smaller in Opteron at the cost of 1 pipe stage in DECODE.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Dec 20 02:09:24 2025
    From Newsgroup: comp.arch

    On 12/19/2025 5:36 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 12/18/2025 4:25 PM, Anton Ertl wrote:
    John Savard <quadibloc@invalid.invalid> writes:
    Given the great popularity of the RISC architecture, I assumed that one of >>>> its characteristics, instructions that are all 32 bits in length, produced >>>> a great increase in efficiency over variable-length instructions.

    Some RISCs have that, some RISCs have two instruction lengths: 16 bits
    and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
    but then eliminated in Power), one variant of Berkeley RISC, ARM T32,
    RISC-V with the C extension, and probably others.


    I have come to realize that 32/64 is probably better than 16/32 here,
    primarily in terms of performance, but also helps with code-density (a
    pure 32/64 encoding scheme can beat 16/32 in terms of code-density
    despite only having larger instructions available).

    My 66000 does not even bother with 16-bit instructions--and still ends
    up requiring fewer instruction count than RISC-V. {32, 64, 96, 128, 160}
    are the instruction sizes; with no instructions ever requiring constants
    to be assembled.


    This is one area where my JX thing saves:
    Greatly reduces cases that need constants to be assembled.

    Doesn't eliminate it, partly because of an issue where JX runs out of
    encoding bits before being able to encode a 64-bit constant, and the
    space for constants (falling just short of 64 bits) is far less useful (because there is a big gulf of very few constants between 33 and 62 bits).

    XG3 does allow full 64-bit constants though.

    Have come up with a possible encoding scheme to add 64-bit constants to
    RISC-V using 96-bit encodings (using a different encoding scheme from my original JX design).


    Have started floating ideas for how to go further within the context of
    the RISC-V encoding space:
    96-bit encodings to add Imm64 forms;
    And some Disp64/Abs64 encodings.


    My CPU core wouldn't necessarily actually support Disp64, but Abs64
    could have some use-cases.

    I could have reason to consider adding an Abs64 special case in both
    RV64G+JX and XG3. Most likely, the Disp64 encoding would be decoded, but
    would likely just misbehave if the displacement is larger than +/- 4GB
    (unless the optional feature to allow Disp48 addressing is enabled).


    The bigger incentive ATM would be to allow for Abs64 branches, as this
    could encode direct calls and branches between RV64GC+JX mode and XG3.


    Granted, would still not address the deficiencies in standard RV64G and RV64GC.


    One could argue "But MOV is less space efficient", can note that it also
    makes sense to try to design the compiler to minimize the number of
    unnecessary MOV instructions and similar (and when using the minimal
    number of register moves, the lack of a small MOV encoding has less
    effect on code density).

    Most of the MOV instructions in My 66000 are found::
    a) before a call--moving values to argument positions,
    b) after a call--moving results to post-call positions,
    c) around loops --moving values for next loop iteration.


    Kinda similar.

    Some of these can be reduced in the register allocator, but some amount
    are unavoidable.

    Either way, when 2R MOV stops being one of the most common instructions,
    the incentive for reducing its size goes down; as does the clock cycles
    spent moving values from one register to another.


    16/32/64 is also sensible, but the existence of 16-bit ops negatively
    effects encoding space (it is more of a strain to have both 16-bit ops
    and 6-bit register fields; but at least some code can benefit from
    having 64 GPRs).

    I agree than RISC-V HAS too many 16-bit instructions, and that it gains
    too little in the code density department by having them.


    Pretty much.


    When used, XG1's 16-bit encoding can see a higher percentage of 16-bit
    ops than RISC-V's, and is less dog chewed.

    Technically, its listing is bigger than RV-C, but this is partly because
    the 16-bit ops in what I now call XG1, were the original form of this
    ISA (it started out 16-bit, then became 16/32, and then started to
    evolve away from the use of 16-bit encodings).

    Well, part of this was because earlier on in the ISA's development, I
    ended up with the 16-bit ISA, which then grew a bunch of 32-bit
    encodings. I ended up doing a test comparing 16-bit only, 16/32, and
    32-bit only variants.

    Result was, basically:
    16/32: Best code density, good performance;
    32 Only: Worse code density, best performance;
    16 Only: Medium code density, worst performance.


    16/32 won this round, but 16-bit only basically died. There was no
    reason to have an ISA variant that was only meh for code density and
    sucked for performance. This model was programmed in basically the same
    way as its SuperH ancestors, but the benefit of having instructions that
    were half the size was reduced when (on average) one needs around 60-70%
    more of them to do the same amount of work.


    I realized that if a fixed-length subset were used, a 32-bit-only subset
    was a much better bet. At the time, it still only had 32 registers, but
    a 32-bit only subset could much more aggressively use these registers (whereas, 16/32 needed to use R16..R31 sparingly; and 16-bit only mostly
    only had the first 16 registers).


    But, that said, something more like the 16-bit instructions from SuperH
    would still (on average) beat the crap out of the RISC-V 'C' encodings.


    A lot could be left out though. One of the major drawbacks of RV-C isn't
    so much what instructions it has or doesn't have, so much as that it is effectively crippled by most of what instructions it does have, use
    3-bit register fields. And, Reg3 is lacking in callee-saved registers,
    meaning many of the instructions are effectively only really usable in
    leaf functions or similar.


    Well, and then from there, later felt a need to expand the register space:
    I considered two ideas at first:
    The first idea, initially rejected, was to drop the 16-bit encodings and expand everything to 64 GPRs;
    Second idea was to mirror part of the the 32-bit encoding space but
    expanded to 64 GPRs (keeping the original ISA as is).

    The first idea was later revived under the name of XG2, and then the
    original form was later renamed to XG1 (ironically, as a backronym from
    XG2, which originally meant XGPR2, or the second design attempt at
    expanding the ISA to 64 GPRs).

    Though, annoyingly, XG2 didn't fully replace XG1, much as XG3 hasn't
    fully replaced either XG1 or XG2.


    Well, and now I am left with a situation:
    XG1 still has the best code density (smallest binaries);
    XG3 currently has the best performance, but worst code density.

    As noted, Doom ".text" sizes (BGBCC, re-running some of them):
    XG1 : 276K
    XG2 : 296K
    RV64GC+JX: 302K
    XG3 : 320K
    RV64G+JX : 340K
    RV64GC : 370K
    RV64G : 440K


    Though, everything is pretty close together here (sizes and rankings
    tend to jostle around some).



    Vs:
    GCC/ELF, RV64GC: 1166K (*)
    GCC ELF, x86-64: 480K
    MSVC EXE, X64 : 770K

    (*): The GCC ELF binary seems to contain significant amounts of
    metadata. More space burned on symbol tables than on the code itself.


    So, say:
    16/32: RV64GC (OK code density)
    16/32/64: RV64GC+JX: Better code density than RV64GC.
    32/64: RV64G+JX (seemingly slightly beats RV64GC)
    But, not as much as GC+JX.
    16/32/64/96: XG1 (still best code for density).
    32/64/96: XG2 and XG3;
    Also good for code density;
    Somehow XG3 loses to XG2 despite being nearly 1:1;
    Though, XG3 has mostly claimed the performance crown.

    Or, descending, code-density:
    XG1, RV64GC+JX, XG2, RV64G+JX, XG3, RV64GC, RV64G
    And, performance:
    XG3, XG2, RV64G+JX, XG1, RV64GC+JX, RV64G, RV64GC

    Rather than tracking code density--which measures cache performance,
    I have come to think that counting instructions themselves is the key.
    If the instruction is present then it ahs to be executed, if not, then
    it was free !! in all real senses.

    Where, both the 16-bit ops, and some lacking features (in RV64G and
    RV64GC) negatively effecting things.

    Like a reasonable OpCode layout.


    I suspect even without it, they likely still would have turned Imm/Disp
    fields into confetti.


    Does make me half wonder how an ISA would look on average if the design approach consisted of drawing cards and rolling a d20 to assign bits
    ("roll d20 for where the register field goes" or "roll d20 for which
    immediate bit comes next").



    Where, the main things that benefit JX here being:
    Jumbo prefixes, extending Imm12/Disp12 to 33 bits;
    I have no prefixes {well CARRY}
    -#Imm5, Imm16, Imm32, Imm64, Disp16, Disp32, Disp64

    I used prefix encodings both for my own ISA and extending RISC-V.


    Indexed Load/Store;
    check

    I will probably at some point need to switch to the Zilx encodings.

    Probably once it gets official approval as an extension, I will go over
    to it. Though, the original proposal was reduced to only having
    Load-Indexed as the RISC-V ARC people really dislike Indexed-Store, and
    also Indexed-Store has a lower usage frequency than Indexed-Load.


    Load/Store Pair;
    LDM, STM, ENTER, EXIT, MM, MS

    For JX, only had Load/Store Pair...


    Re-adding ADDWU/SUBWU and similar.
    {int,float}|u{OpCode}|u{Byte, Half, Word, DBLE}

    ADDWU/SUBWU are limited in scope.

    In my own ISA, I had the same functionality as ADDU.L and SUBU.L, ...

    They existed in early BitManip but were dropped in favor of further
    canonizing the use of ADDW (for sign-extending unsigned int) and the
    mess that is the ".UW" instructions (which are multiplying, much to my annoyance).

    Better IMO to fix things in ways that are "not stupid" vs just throwing
    more cruft at the problem.

    In my compiler, I was like, "Yeah, no, I am going to do this stuff in a
    way that isn't stupid".

    But, the rest of the RISC-V community is bent on pushing forward down
    this path, which means ".UW" cruft (instructions which selectively
    zero-extend Rs2).



    The Zba instructions also help,
    but Load/Store pair greatly reduces effect of Zba.


    It would be possible to get better code density than 'C' with some tweaks: >> Reducing many of the imm/disp fields by 1 bit;
    Would free up a lot of encoding space.
    Imm6/Disp6 eats too much encoding space here.
    Which I why -#imm5 works better.

    Yes.

    For 16 bit instructions, having a bunch of instructions with 6 bit
    immediate and displacement values eats too much encoding space.


    Making most of the register fields 4 bits (X8..X23)
    Can improve hit-rate notably over Reg3.

    But:
    Main merit of 'C' is compatibility with binaries that use 'C';
    This merit would be lost by modifying or replacing 'C'.

    I can still fit my entire ISA into the space vacated by C.


    Likewise...

    This is where XG3 came from:
    Drop RISC-V C extension;
    Awkwardly shove nearly the entirety of XG2 into the hole that was left over; ...

    Well, granted, I couldn't fit the *entirety* of XG2 into the hole:
    It lost WEX and a few misc features in the process;
    So, XG3 goes over to using a RISC superscalar approach rather than LIW,
    but, more or less...

    It kept predication, had I not kept predication, XG3 would have used
    around 1/3 the encoding space that it currently uses.



    ----------------------
    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually
    correspond to the end of the previous instruction, but with 16-bit and
    32-bit instructions this potentially doubles the amount of instruction
    decoders necessary, plus the circuit for selecting the ones that are
    at actual instruction starts. I guess that this is the reason why ARM
    uses an uop cache in cores that can execute ARM T32. The fact that
    more recent ARM A64-only cores have often no uop cache while their
    A64+T32 predecessors have had one reinforces this idea.


    I took the option of not bothering with parallel execution for 16-bit ops.

    I took the option of not bothering with 16-bit Ops.

    Well, as noted, if I were doing it again, I wouldn't recreate XG1 as it is.

    As for RISC-V's C extension, only reason I had added this was because it seemed basically unavoidable if I wanted binary compatibility with Linux binaries (and GCC).

    Seemingly, even if configured for RV64G, "GNU glibc" still goes and
    manages to throw 'C' instructions into the mix.

    With the apparent rise of the RVA23 Profile, am probably going to need
    to deal somehow with the V extension as well, but my current plan is to
    try to deal with this via traps and hot patching rather than actually supporting V in HW.




    -----------------------
    Even if 16-bit ops could be superscalar though, the benefits would be
    small: Code patterns that favor 16-bit ops also tend to be lower in
    terms of available ILP.

    I suspect that argument setup before and result take-down after call
    would have quite a bit of parallelism.
    I suspect that moving fields around for the next loop iteration would
    have significant parallelism.

    Blobs of Loads and Stores are not ILP that my CPU core can use...

    It is mostly limited to ILP involving ALU ops and similar.


    A lot of areas dominated by RV-C ops tend to be heavy in RAW
    dependencies and instruction chains depending on the prior instruction.
    This isn't stuff that really does well.

    Stuff that gets better ILP tends to look more like:
    s1=i0-(i1>>1); s0=s1+i1;
    s3=i2-(i3>>1); s2=s3+i3;
    s5=i4-(i5>>1); s4=s5+i5;
    s7=i6-(i7>>1); s6=s7+i7;
    t1=s0-(s2>>1); t0=t1+s2;
    t3=s1-(s3>>1); t2=t3+s3;
    t5=s4-(s6>>1); t4=t5+s6;
    t7=s5-(s7>>1); t6=t7+s7;
    u1=t0-(t4>>1); u0=u1+t4;
    u3=t1-(t5>>1); u2=u3+t5;
    u5=t2-(t6>>1); u4=u5+t6;
    u7=t3-(t7>>1); u6=u7+t7;
    But, this sort of code tends not to map over to RV-C instructions all
    that well.

    So, one ends up with separate domains of:
    Parallel stuff where the superscalar nails it, but almost entirely
    32-bit ops;
    Code with a lot of 16-bit ops that wouldn't do so well even if the
    superscalar worked on RV-C ops.

    ------------------------------
    Decoding at 2 or 3 wide seems to make the most sense:
    Gets a nice speedup over 1;
    Works with in-order.

    Here, 3 is slightly better than 2.
    But, getting that much benefit from going any wider than this, is likely
    to require some amount of "heavy lifting".

    Probably no conducive to FPGA implementations due to LUT count and
    special memories {predictors, ..., TLBs, staging buffers, ...}


    Yeah.

    The hardware in my case is still pretty dumb.
    More in the areas of using lots of registers and round-robin allocating
    them so that ILP is good; because RAW dependencies can really mess up ILP.

    Round-robin register allocation eats a lot of registers though.



    So, while a 4 or 5 wide in-order design could be possible, pretty much
    no normal code is going to have enough ILP to make it worthwhile over 2
    or 3.

    1-wide 0.7 IPC
    2-wide 1.0 IPC gain of 50%
    3-wide 1.4 IPC gain of 40%
    6-wide 2.2 IPC gain of 50% from doubling the width
    10wide 3.2 IPC gain of 50% from almost doubling width



    Depends a lot on the code, but yeah, I have seen enough gains that 3
    benefits over 2, but 4 or 5 hits a bottleneck.

    Would need to have wackiness like register renaming and similar.



    Also 2 or 3 works reasonably well with a 96-bit fetch:

    But Fetches ae 128-bits wide !!! and the average instruction is 35-bits wide. ------------------------

    In x86, yes, maybe.


    For my CPU core, fetch is 96 bits.
    If you fetch a 96-bit instruction, it only fetches 1 instruction;
    So, currently superscalar only happens with narrower instructions.


    For XG3, I am coming out with an average-case instruction size of 32.2 bits.

    For RV64G+JX: 33.92 bits.

    Mostly because RV64G+JX seems to have a larger proportion of Jumbo prefixes.


    However, superscalar only on 32-bit encodings works out OK as jumbo
    prefixed instructions are a relative minority of the total here.


    It is more RV-C with ~ 30% or so of the instructions becoming 16-bit,
    and roughly half the time the 32-bit instructions are misaligned, that superscalar gets wrecked.

    Granted, one can argue that superscalar that doesn't deal with 16-bit
    ops, misaligned ops, or instruction sequences crossing cache-line
    boundaries, is maybe kinda lame...



    One trick here could be to precompute a lot of this when fetching cache
    lines, though a full instruction length could not be determined at fetch
    time if the instruction crosses a cache line unless we have also fetched
    the next cache line. Full instruction length could be determine in
    advance (at fetch time) if it always fetches both cache-lines and then
    determines the lengths for one of them before writing to the cache
    (possibly if the next line is fetched, it contents are not written to
    the cache as lengths can't be fully determined yet).

    All of the above was solved in Athlon, and then made 3|u smaller in Opteron at the cost of 1 pipe stage in DECODE.

    Maybe, but I don't know how they implemented it. I am just sorta
    guessing how it could be implemented assuming I were interested in implementing an x86 core...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Dec 20 04:35:12 2025
    From Newsgroup: comp.arch

    On 12/20/2025 2:07 AM, Robert Finch wrote:
    On 2025-12-19 6:36 p.m., MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:


    ...


    So, while a 4 or 5 wide in-order design could be possible, pretty much
    no normal code is going to have enough ILP to make it worthwhile over 2
    or 3.

    1-wide 0.7 IPC
    2-wide 1.0 IPC gain of 50%
    3-wide 1.4 IPC gain of 40%
    6-wide 2.2 IPC gain of 50% from doubling the width
    10wide 3.2 IPC gain of 50% from almost doubling width
    Also 2 or 3 works reasonably well with a 96-bit fetch:

    But Fetches ae 128-bits wide !!! and the average instruction is 35-
    bits wide.

    Could the average instruction size be an argument for the use of wider (40-bits) instructions? One would think that the instruction should be a
    bit wider than average. Bin trying to shrink Qupls4 instructions down to 40-bits for Qupls5. The odd size is not that great an issue if variable lengths are supported.


    Doesn't match with my stats, as noted:
    XG2 and XG3: Roughly 32.2 bits
    RV64G+JX: 33.92

    So, 40-bits would on-average be worse, unless there were enough jumbo
    prefixes to push it closer to a 40 bits average (not currently true) and
    if the non-power-of-2-ness of 40 bits were less of an issue.



    Looks like in both cases, the number of total instructions is similar (a little over 80k), but RV64G+JX has a lot more jumbo prefixes.


    Though, there are a few things XG3 has that RV64G+JX lacks:
    ADD Imm17s, Rn //used some (lesser case)
    MOV Imm17s, Rn //also used a lot, but was added in JX
    MOV.{L/Q} (GBR/GP, Disp16*4/8), Rn //global variables (used a lot)


    The latter apparently saves roughly 5000 jumbo prefixes, or around 20K.
    Which roughly matches up with the binary-size delta between XG3 and
    RV64G+JX.


    Apart from the "MOV Imm17s, Rn" case (having been added by my JX
    extension), there would be roughly 6000 more jumbo prefixes (and/or 6000 LUI+ADDI pairs; adding roughly another 24K, for cases which fit into
    Imm17s but which don't fit into Imm12s), ...


    Which, kinda makes sense in a way:
    BGBCC essentially treats XG3 as a special case of the RISC-V mode, so effectively tries to drive both of them with more or less the same
    logical instruction sequence (albeit with some differences in ABI rules
    and register usage).

    So, it would appear the main difference, at least in terms of binary
    size, is that for XG3 less space is spent loading/storing global variables.


    As for registers:
    RV64G+JX:
    GPRs and FPRs are split.
    Jumbo prefixes can join them, but this costs jumbo prefixes.
    So, the register allocator partitions them.
    For plain RV64G/RV64GC, it is a hard divide.
    XG3:
    There are 64 GPRs.

    ABI rules are mostly similar between RV64G and XG3:
    X0..X4: ZERO, LR/RA, SP, GBR/GP, TP
    X5..X7: Stomp/Scratch
    X8/X9: Callee-Save
    X10..X17: Arg1..Arg8
    X18..X27: Callee-Save
    X28..X31: Scratch
    F0..F3: Scratch
    F4..F7: Scratch (RV-ABI) | Callee-Save (XG3 ABI)
    F8/F9: Callee-Save
    F10..F17: Scratch (RV-ABI) | Arg9..Arg16 (XG3 ABI)
    F18..F27: Callee-Save
    F28..F31: Scratch

    Technically, either ISA can be compiled with either set of ABI rules
    though (but on RV there is less benefit from the rule tweaks).

    The RV-ABI doesn't exactly match either LP64 or LP64D ABIs though, as it
    is sort of a hybrid.

    The LP64D ABI split up integer and FPU arguments, BGBCC doesn't do this
    (and doing so would be actively detrimental in some areas; even if
    arguably for FPU heavy code it might arguably make sense to pass FPU
    values in FPU registers...).


    There is some uncertainty about the handling of passing and returning
    structs by value, more related to seeming version inconsistencies in the
    ABI docs here.

    For BGBCC, rules are mostly:
    Passing/returning, basic:
    1-8 bytes: 1 register
    9-16 bytes: 2 registers (even pairs only)
    17+ bytes: by reference
    struct return by reference:
    hidden magic argument passed in by caller.

    Some docs had implied rules more like those in the SysV AMD64 ABI, but
    no, not gonna do this...

    Seems like my approach mostly matches up with at least some versions of
    the RISC-V ABI here.

    Though, BGBCC only uses even pairs, so if one tries to pass a 16-byte
    struct or similar and it is on an odd register number, it will bump it
    up to the next even register. This differs as it appears the RV ABI
    would also use odd numbered pairs.


    This is very different from the ABI used for XG2 though.
    Different register assignments, different register balance, etc.
    Args:
    R4..R7, R20..R23, R36..R39, R52..R55
    Scratch:
    R2/R3, R16..R19, R32..R35, R48..R51
    Callee-Save:
    R8..R14, R24..R31, R40..R47, R56..R63
    SPR:
    R0: DLR (Stomp)
    R1: DHR (Stomp / Scratch / Alt Link Register)
    R15: SP (Stack)

    Well, also:
    XG1/XG2 ABI:
    Had a register spill space (64 or 128 bytes);
    RV/XG3 ABI:
    No Spill Space

    Neither ABI uses a frame pointer (and BGBCC uses exclusively
    constant-sized stack frames).

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Dec 20 10:47:35 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:

    Could the average instruction size be an argument for the use of wider (40-bits) instructions? One would think that the instruction should be a
    bit wider than average. Bin trying to shrink Qupls4 instructions down to 40-bits for Qupls5. The odd size is not that great an issue if variable lengths are supported.

    Can you show a few examples of what these wide instructions are used
    for? Seems like a lot of bits to me...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 20 08:47:44 2025
    From Newsgroup: comp.arch

    On 2025-12-20 5:47 a.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:

    Could the average instruction size be an argument for the use of wider
    (40-bits) instructions? One would think that the instruction should be a
    bit wider than average. Bin trying to shrink Qupls4 instructions down to
    40-bits for Qupls5. The odd size is not that great an issue if variable
    lengths are supported.

    Can you show a few examples of what these wide instructions are used
    for? Seems like a lot of bits to me...

    The average instruction size being about 35 bits most likely comes
    from including immediate constant bits. 32-bits works for most things,
    if one is willing to increase the dynamic instruction count.
    But extra bits can be used for selecting small immediates, vector
    register selection.

    7 opcode
    6 dest reg + sign control
    6 src1 reg + sign control
    6 src2 reg + sign control
    7 func code
    -------
    32

    Qupls4 has the addional bits for
    6 src3 reg + sign control
    4 vector register select
    3 small immediate select
    3 second ALU op
    ---
    16

    For immediates Qupls4 has

    7 opcode
    6 dst reg
    6 src1 reg
    2 precision control
    27 bit immediate constant

    Having three source registers allows: fused multiply add, bit field ops,
    three input adds, multiplex, and a few others. There are some
    instructions with four source register (fused dot product) or two
    destinations (carry outputs / fp status).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 20 09:23:53 2025
    From Newsgroup: comp.arch

    On 2025-12-20 8:47 a.m., Robert Finch wrote:
    On 2025-12-20 5:47 a.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:

    Could the average instruction size be an argument for the use of wider
    (40-bits) instructions? One would think that the instruction should be a >>> bit wider than average. Bin trying to shrink Qupls4 instructions down to >>> 40-bits for Qupls5. The odd size is not that great an issue if variable
    lengths are supported.

    Can you show a few examples of what these wide instructions are used
    for?-a Seems like a lot of bits to me...

    -aThe average instruction size being about 35 bits most likely comes
    from including immediate constant bits. 32-bits works for most things,
    if one is willing to increase the dynamic instruction count.
    -aBut extra bits can be used for selecting small immediates, vector register selection.

    -a7 opcode
    -a6 dest reg + sign control
    -a6 src1 reg + sign control
    -a6 src2 reg + sign control
    -a7 func code
    -a-------
    -a32

    -aQupls4 has the addional bits for
    -a6 src3 reg + sign control
    -a4 vector register select
    -a3 small immediate select
    -a3 second ALU op
    ---
    16

    -aFor immediates Qupls4 has

    -a 7 opcode
    -a 6 dst reg
    -a 6 src1 reg
    -a 2 precision control
    -a27 bit immediate constant

    Having three source registers allows: fused multiply add, bit field ops, three input adds, multiplex, and a few others. There are some
    instructions with four source register (fused dot product) or two destinations (carry outputs / fp status).



    I am not thrilled at the use of 48-bit instructions, I suspect Qupls4
    would be power-hungry, not really competitive. I have been using it more
    as a learning tool. Should be able to pack things into fewer bits.
    32-bits is a decent size, with prefixes/postfixes for things that will
    not fit.

    I made a 36-bit ISA a while ago on the notion of average instruction
    size (18-bit compressed instructions). Writing the assembler for it was something special. Took about 10x as much effort as a byte-oriented one.
    So, I am not keen on non-byte sizes. But maybe another 36-bitter. Takes
    a bit to get used to addressing for the instruction pointer.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 20 20:19:53 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-12-19 6:36 p.m., MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted: --------------------------------------------

    Probably no conducive to FPGA implementations due to LUT count and
    special memories {predictors, ..., TLBs, staging buffers, ...}

    So, while a 4 or 5 wide in-order design could be possible, pretty much
    no normal code is going to have enough ILP to make it worthwhile over 2
    or 3.

    1-wide 0.7 IPC
    2-wide 1.0 IPC gain of 50%
    3-wide 1.4 IPC gain of 40%
    6-wide 2.2 IPC gain of 50% from doubling the width
    10wide 3.2 IPC gain of 50% from almost doubling width

    Also 2 or 3 works reasonably well with a 96-bit fetch:

    But Fetches ae 128-bits wide !!! and the average instruction is 35-bits wide.

    Could the average instruction size be an argument for the use of wider (40-bits) instructions?

    I think that is an independent variable, especially if you have variable
    length instructions. Also note: My CARRY instruction concatenates 2-bits
    on up to 8 subsequent instructions--but it is used seldom enough to avoid disturbing the long term average bit count.

    One would think that the instruction should be a
    bit wider than average. Bin trying to shrink Qupls4 instructions down to 40-bits for Qupls5. The odd size is not that great an issue if variable lengths are supported.

    ------------------------
    One trick here could be to precompute a lot of this when fetching cache
    lines, though a full instruction length could not be determined at fetch >> time if the instruction crosses a cache line unless we have also fetched >> the next cache line. Full instruction length could be determine in
    advance (at fetch time) if it always fetches both cache-lines and then
    determines the lengths for one of them before writing to the cache
    (possibly if the next line is fetched, it contents are not written to
    the cache as lengths can't be fully determined yet).

    All of the above was solved in Athlon, and then made 3|u smaller in Opteron at the cost of 1 pipe stage in DECODE.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 20 20:15:51 2025
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    I have come to realize that 32/64 is probably better than 16/32 here,
    primarily in terms of performance, but also helps with code-density (a
    pure 32/64 encoding scheme can beat 16/32 in terms of code-density
    despite only having larger instructions available).

    My 66000 does not even bother with 16-bit instructions--and still ends
    up requiring fewer instruction count than RISC-V. {32, 64, 96, 128, 160} are the instruction sizes; with no instructions ever requiring constants
    to be assembled.

    Indeed, My 66000 aims for "fat" instructions so as to try and reduce instruction counts. That should hopefully result in an efficient ISA:
    fewer instructions should cost less runtime resources (as long as they
    don't get split into more ++ops).

    Most of the MOV instructions in My 66000 are found::
    a) before a call--moving values to argument positions,
    b) after a call--moving results to post-call positions,
    c) around loops --moving values for next loop iteration.
    [...]
    I suspect that argument setup before and result take-down after call
    would have quite a bit of parallelism. I suspect that moving fields
    around for the next loop iteration would have significant parallelism.

    Are you saying that you expect the efficiency of My 66000 could be
    improved by adding some way to express those moves in a better way?

    Probably, yes; I just never found a way to do it (yet).

    A key element of the Mill is/was its ability to "permute" its belt
    elements in a single cycle. I still don't fully understand how this is encoded in the ISA and implemented in hardware, but it sounds like
    you're hinting in the same direction: some kind of "parallel move" instruction with many inputs and many outputs.

    For argument setup (calling side) one needs MOV {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the
    constraints of the GBOoO Data Path.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 20 20:45:35 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-12-20 5:47 a.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:

    Could the average instruction size be an argument for the use of wider
    (40-bits) instructions? One would think that the instruction should be a >> bit wider than average. Bin trying to shrink Qupls4 instructions down to >> 40-bits for Qupls5. The odd size is not that great an issue if variable
    lengths are supported.

    Can you show a few examples of what these wide instructions are used
    for? Seems like a lot of bits to me...

    The average instruction size being about 35 bits most likely comes
    from including immediate constant bits. 32-bits works for most things,
    if one is willing to increase the dynamic instruction count.
    But extra bits can be used for selecting small immediates, vector
    register selection.

    7 opcode
    6 dest reg + sign control
    6 src1 reg + sign control
    6 src2 reg + sign control
    7 func code
    -------
    32

    Just recently, I dropped sign control on the result, Compiler can almost
    always use the sign control on its next use as operand.

    I stayed with 32 registers.
    I only needed 6-bits of Major OpCode::
    a) I still have 23 slots unassigned out of 64.
    GROUP Major
    b) all VLE stuff 001xxx
    c) all weird constant 000xxx
    d) all branches 011xxx
    e) all 3 register stuff 010xxx
    f) all disp16 stuff 10xxxx
    g) all imm16 stuff 11xxxx

    Major OpCodes permanently reserved::

    OpCode corresponds
    000000 positive +0..2^26
    001111 positive FP +1/32
    010000 positive FP +128
    101111 negative FP -1/32
    110000 negative FP -128
    111111 negative -0..2^26

    Qupls4 has the addional bits for
    6 src3 reg + sign control
    4 vector register select
    3 small immediate select
    3 second ALU op
    ---
    16

    For immediates Qupls4 has

    7 opcode
    6 dst reg
    6 src1 reg
    2 precision control
    27 bit immediate constant

    Having three source registers allows: fused multiply add, bit field ops, three input adds, multiplex, and a few others.

    With the advent of IEEE 754-2008 there is no longer ANY REASON not
    to have 3-operand instructions--FMAC, Insert, CMOV. I get 3-input
    adds from CARRY, along with double width shifts, inserts, and exact
    floating points.

    There are some
    instructions with four source register (fused dot product) or two destinations (carry outputs / fp status).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sat Dec 20 15:55:08 2025
    From Newsgroup: comp.arch

    For argument setup (calling side) one needs MOV {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}

    In terms of encoding, these are fairly easy and could each fit within
    a 32bit instruction.

    For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    IIUC these could have any number of registers and the destination and
    source regs can be "anything", so the encoding would take up more space. Arguably it might be possible in many/most cases to arrange for
    {Rm,Rn,Rj} to be {R1..Rn}, so it might be able to use the same
    instruction as the call-setup.

    I just can't see how to make these run reasonably fast within the
    constraints of the GBOoO Data Path.

    Hmm... One would hope this can be handled entirely in the renamer
    without touching the actual data path, but ... sorry: if you don't know
    how to do it, I sure don't either.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 20 23:14:58 2025
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    For argument setup (calling side) one needs MOV {R1..R5},{Rm,Rn,Rj,Rk,Rl} For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}

    In terms of encoding, these are fairly easy and could each fit within
    a 32bit instruction.

    You are going to put 6|u5-bit fields in a single 32-bit instruction with
    a 6-bit Major OpCode ?!?! I would like to see it done. Remember: all
    specifiers are in the first 32-bits of the "instruction" only constants
    are used as Variable Length.

    For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    IIUC these could have any number of registers and the destination and
    source regs can be "anything", so the encoding would take up more space. Arguably it might be possible in many/most cases to arrange for
    {Rm,Rn,Rj} to be {R1..Rn}, so it might be able to use the same
    instruction as the call-setup.

    In principle I buy this argument:: in practice I can't see it happening.

    I can see an encoding that would provide a "bunch of MOVs/Renames"
    but only if I disobey a principle tenet of ISA encoding {One that RISC-V
    threw away on day 1} and that is; the register specification fields are
    at fixed locations. It is this tenet that removed some <arguably thin>
    logic before multiplexing the specifiers into the RF decoder. The fixed position argument has neither the logic nor the multiplexer, RF specifiers
    are wired directly to the RF/Renamer decoder ports directly.

    I just can't see how to make these run reasonably fast within the constraints of the GBOoO Data Path.

    Hmm... One would hope this can be handled entirely in the renamer
    without touching the actual data path, but ... sorry: if you don't know
    how to do it, I sure don't either.

    Once one goes beyond the 3-operand 1-result property, all sorts of little things start to break--like multiplexing the RF specifiers. The Data-Path
    and the Register/Renamer ports are all designed to this FMAC requirement, giving us CMOV and INSert instructions with reasonable encodings.

    Right now, there are no register specifiers in the variable length part
    of ISA--just constants.

    It is also not exactly clear how one "makes" an instruction with {2,3,4,5,6,7} writes traverse the pipeline smoothly. I took serious consideration to find
    an smooth solution to even {2} results, and for this I built an accumulator attached to the 3-operand+1-result function units where the added operand is read once (if needed) and written once (if needed) often not requiring ANY
    RF activity in support of the CARRY variable itself.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 20 19:41:26 2025
    From Newsgroup: comp.arch

    On 2025-12-20 6:14 p.m., MitchAlsup wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    For argument setup (calling side) one needs MOV {R1..R5},{Rm,Rn,Rj,Rk,Rl} >>> For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}

    In terms of encoding, these are fairly easy and could each fit within
    a 32bit instruction.

    You are going to put 6|u5-bit fields in a single 32-bit instruction with
    a 6-bit Major OpCode ?!?! I would like to see it done. Remember: all specifiers are in the first 32-bits of the "instruction" only constants
    are used as Variable Length.

    For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    IIUC these could have any number of registers and the destination and
    source regs can be "anything", so the encoding would take up more space.
    Arguably it might be possible in many/most cases to arrange for
    {Rm,Rn,Rj} to be {R1..Rn}, so it might be able to use the same
    instruction as the call-setup.

    In principle I buy this argument:: in practice I can't see it happening.

    I can see an encoding that would provide a "bunch of MOVs/Renames"
    but only if I disobey a principle tenet of ISA encoding {One that RISC-V threw away on day 1} and that is; the register specification fields are
    at fixed locations. It is this tenet that removed some <arguably thin>
    logic before multiplexing the specifiers into the RF decoder. The fixed position argument has neither the logic nor the multiplexer, RF specifiers are wired directly to the RF/Renamer decoder ports directly.

    I just can't see how to make these run reasonably fast within the
    constraints of the GBOoO Data Path.

    Hmm... One would hope this can be handled entirely in the renamer
    without touching the actual data path, but ... sorry: if you don't know
    how to do it, I sure don't either.

    Once one goes beyond the 3-operand 1-result property, all sorts of little things start to break--like multiplexing the RF specifiers. The Data-Path
    and the Register/Renamer ports are all designed to this FMAC requirement, giving us CMOV and INSert instructions with reasonable encodings.

    Right now, there are no register specifiers in the variable length part
    of ISA--just constants.

    It is also not exactly clear how one "makes" an instruction with {2,3,4,5,6,7}
    writes traverse the pipeline smoothly. I took serious consideration to find an smooth solution to even {2} results, and for this I built an accumulator attached to the 3-operand+1-result function units where the added operand is read once (if needed) and written once (if needed) often not requiring ANY
    RF activity in support of the CARRY variable itself.


    Stefan

    I tentatively added such an instruction to MOVE {Ra,Rb,RC},{Rx,Ry,Rz}
    using the micro-op translator. (Qupls4 has 48-bits to work with). But it
    may be too slow, I have to see what shows up on the timing path. It is
    busted into zero to three micro-ops so not any faster, but it is more
    code dense.

    I am relying on the micro-ops to have a consistent format fed to the
    renamer. The ISA instructions may differ slightly.

    Having fun with the dispatcher. I had it as an out-of-order unit, when
    it should really be part of the in-order pipeline to reduce the size.
    Handling the dispatch OoO was easier as there may not be enough units available to dispatch to. OoO dispatch did not need to worry about
    stalling the pipeline. Switching it to in-order cut the size down to -+
    what it was though.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Dec 21 04:26:57 2025
    From Newsgroup: comp.arch

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the
    constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for doing stuff like that in a GBOoO machine. Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you would
    know all about that, there must be a reason why it doesn't apply in these cases.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 21 10:14:20 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:

    [...]
    I made a 36-bit ISA a while ago on the notion of average instruction
    size (18-bit compressed instructions). Writing the assembler for it was something special. Took about 10x as much effort as a byte-oriented one.
    So, I am not keen on non-byte sizes. But maybe another 36-bitter. Takes
    a bit to get used to addressing for the instruction pointer.

    You'd probably need 128-bit bundles with a granularity of 18 bits.
    This is certainly doable (see Itanium) but very probably not fun
    to write. The instruction pointer would then be 18-bit aligned,
    with one out of eight addresses invalid.

    Hm... if there were instructions which could span several bundles,
    the two left-over bits could be used for synchronization, to trap
    branches which land in the middle of instructions, or for something
    else.

    If your architecture has 64 registers, it could probably use
    the extra encoding space vs. 32.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 21 18:12:14 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken here...

    when I read this, I thought that there was a standard technique for doing stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no
    different than any other calculation, except that no mangling of the
    bits is going on.

    Just break down all the fancy instructions into RISC-style pseudo-ops. But apparently, since you would know all about that, there must be a reason why it doesn't apply in these cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 21 12:32:03 2025
    From Newsgroup: comp.arch

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the
    constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no different than any other calculation, except that no mangling of the
    bits is going on.

    Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you would
    know all about that, there must be a reason why it doesn't apply in these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 21 20:32:44 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
    John Savard <quadibloc@invalid.invalid> posted:

    However, this involved overhead, and the headers would themselves take
    time to decode. In any event, all the schemes I came up with were also
    elaborate and overly complicated.

    As I warned...

    I wasn't blind - of course I knew that all along. But what I still did
    fail to see was any good alternative.

    But I have finally realized what I think is the decisive reason why I
    had been mistaken.

    At Last ?!?

    On further reflection, I think I had realized that decoding is done ahead
    of execution, and thus can be thought of as mostly done in parallel with
    it, before. But decoding still has to be done *first* before execution can start. So I felt that in a design where super-aggressive pipelining or vectorization allows many instructions to be done in parallel, if decoding is necessarily serial, it could still become a bottleneck.

    Before modern pipelined computers, which have multi-stage pipelines for
    instruction _execution_, a simple form of pipelining was very common -
    usually in the form of a three-stage fetch, decode, and execute
    pipeline.

    Since the decoding of instructions can be so neatly separated from
    their execution, and thus performed well in advance of it, any overhead
    associated with variable-length instructions becomes irrelevant because
    it essentially takes place very nearly completely in parallel to
    execution.

    Or in other words, if you can decode K-instructions per cycle, you'd
    better be able to execute K-instructions per cycle--or you have a
    serious blockage in your pipeline.

    No.

    If you flipped "decode" and "execute" in that sentence above, I would 100% agree. And maybe this _is_ just a typo.

    Not a typo--the part of the pipeline which is <dynamically> narrowest
    is the part that limits performance. I suggest strongly that you should
    not make/allow the decoder to play that part. It is <relatively> easy to
    widen the execution path--just throw function units at the problem--we
    used to call that "Catch up bandwidth". If you get behind in execution
    you can "catch up" if you have enough FUs and write ports.

    But if you actually did mean that sentence exactly as written, I would disagree. This is why: I regard executing instructions as 'doing the
    actual work' and decoding instructions as... some unfortunate trivial overhead that can't be avoided.

    Partially agree--as long as you do not make it a performance limiter.

    Hence, if I can decode instructions much faster than I can execute them... _possibly_ the decoder is overdesigned, but it's also perfectly possible that there isn't really a slower decoder design that would make sense.

    Many modern wide decoders take {3,4,5} cycles, whereas most (60%ish) take
    only 1 cycle to execute {Integer, logical, shift, store, branch}. All those cycles are present so that the decoder can run at optimal throughput, and
    still present a vonNeumann execution model to SW.

    And maybe this perspective _explains_ why I dabbled in elaborate schemes
    to allow decoding in parallel. I absolutely refused to allow decoding to become a bottleneck, no matter how aggressively OoO the execution part is designed for breakneck speed at all costs.

    How many headers can you decode in a single cycle ??? And decoding all
    those headers in a cycle allows you to decode how many instructions.

    Without headers, I can PARSE 16+instructions in a single cycle. Where
    parse means routing all of the needed bits to each unary-instruction
    DECODER to follow.


    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 21 21:21:57 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the
    constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for doing >> stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no different than any other calculation, except that no mangling of the
    bits is going on.

    Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you would >> know all about that, there must be a reason why it doesn't apply in these >> cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..

    The 2-operand+displacement LD/STs have a lock bit in the instruction--that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.

    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier {A.K.A. a prefix}.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun Dec 21 17:31:40 2025
    From Newsgroup: comp.arch

    For argument setup (calling side) one needs MOV {R1..R5},{Rm,Rn,Rj,Rk,Rl} >> > For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
    In terms of encoding, these are fairly easy and could each fit within
    a 32bit instruction.
    You are going to put 6|u5-bit fields in a single 32-bit instruction with
    a 6-bit Major OpCode ?!?!

    AFAICT we need "only" 5x 5bit (if we hardcode the destination to be
    R1..R5 in one instruction and if we hardcode R1..R5 as the source
    registers in the other instruction).

    I would like to see it done.

    It's definitely tight (we still need some way to indicate how many
    registers we want to move), but it seems within the realm of possible.
    Whether it "pays for its encoding cost" is a separate question.

    I can see an encoding that would provide a "bunch of MOVs/Renames"
    but only if I disobey a principle tenet of ISA encoding {One that RISC-V threw away on day 1} and that is; the register specification fields are
    at fixed locations. It is this tenet that removed some <arguably thin>
    logic before multiplexing the specifiers into the RF decoder. The fixed position argument has neither the logic nor the multiplexer, RF specifiers are wired directly to the RF/Renamer decoder ports directly.

    Clearly, if we want a "multi-move" instruction, it has to break such assumptions. I was hoping that maybe it's OK to break it because
    [ assuming the execution doesn't really exist because its all done in
    the renamer, ] the "execution" of this instruction doesn't actually need
    that data.

    It is also not exactly clear how one "makes" an instruction with {2,3,4,5,6,7} writes traverse the pipeline smoothly. I took serious consideration to find an smooth solution to even {2} results, and for
    this I built an accumulator attached to the 3-operand+1-result
    function units where the added operand is read once (if needed) and
    written once (if needed) often not requiring ANY RF activity in
    support of the CARRY variable itself.

    I'm not really surprised. EfOU
    The only way I can see it work is if we can arrange for such
    a multi-move to *not* have multiple outputs, and instead be treated as
    "just a renaming".

    The way I see it, the problem is that after

    MULTIMOVE {R1,R2} <= {R6,R8}

    the preceding instruction which generated a result into R6 now needs to
    put the result into both R6 and R1. Maybe a way to avoid that problem
    is to make the renaming architectural. I.e. add a "register renaming
    table" (RRT), and introduce the instruction RENAME which changes
    that RRT. Whenever an instruction wants to read register Rn, the actual architectural register we'll read is obtained by passing `n` through RRT.
    [ Any write to a register would presumably reset that register's entry in
    RRT, to avoid too many headaches for ASM programmers. ]
    All problems can be solved by adding a level of indirection, they say.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Dec 22 08:24:29 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    The way I see it, the problem is that after

    MULTIMOVE {R1,R2} <= {R6,R8}

    the preceding instruction which generated a result into R6 now needs to
    put the result into both R6 and R1.

    Not necessarily. It can also mean that in the register map both the
    entry for R6 and R1 point to the same physical register (where the
    result of the preceding instruction that wrote into R6 landed). That
    would make it necessary to reference-count the physical registers.

    Given that the register renamer in Lion Cove manages to perform 7.02
    dependent (and 7.25 independent) moves per cycle, I expect that it
    uses a technique like I outlined above.

    Maybe a way to avoid that problem
    is to make the renaming architectural. I.e. add a "register renaming
    table" (RRT), and introduce the instruction RENAME which changes
    that RRT. Whenever an instruction wants to read register Rn, the actual >architectural register we'll read is obtained by passing `n` through RRT.

    All of that happens with microarchitectural renaming (your RRT is
    called RAT (register alias table), however). Your "RENAME"
    instruction is called "MOV". Why make the RAT architectural?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Mon Dec 22 20:00:06 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> wrote:
    On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually
    correspond to the end of the previous instruction,

    Oh, yes, I had always realized that, but dismissed it as far too wasteful.

    Well, Mitch claims average 35 bits per instructions, that means
    about 90% utilization of decoders, so not bad. Probably more
    waste is due to muxes needed to shift instructions into right
    positions, but since you allow variant encodings you need
    muxes too.

    Also, consider that alternative to variable length instructions
    is to use longer instructions or more of them. In case of constants
    classic RISC needs more instructions to assemble a constant than
    extra words in variable length encoding. So classic RISC is
    going to need more "decode events" than machine using variable
    length encoding with 32-bit units, even though all "decode events"
    are "useful" on RISC and some "decode events" on variable length
    machine are discarded.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 22 13:49:16 2025
    From Newsgroup: comp.arch

    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc} >>>>>
    I just can't see how to make these run reasonably fast within the
    constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken >>>> here...

    when I read this, I thought that there was a standard technique for doing >>>> stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no
    different than any other calculation, except that no mangling of the
    bits is going on.

    Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you would >>>> know all about that, there must be a reason why it doesn't apply in these >>>> cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember
    something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..

    The 2-operand+displacement LD/STs have a lock bit in the instruction--that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.

    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier {A.K.A. a prefix}.

    Thanks for the clarification.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 22 22:38:54 2025
    From Newsgroup: comp.arch


    antispam@fricas.org (Waldek Hebisch) posted:

    John Savard <quadibloc@invalid.invalid> wrote:
    On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually
    correspond to the end of the previous instruction,

    Oh, yes, I had always realized that, but dismissed it as far too wasteful.

    Well, Mitch claims average 35 bits per instructions, that means
    about 90% utilization of decoders, so not bad. Probably more
    waste is due to muxes needed to shift instructions into right
    positions, but since you allow variant encodings you need
    muxes too.

    Also note:: One can use these Muxes to route the parsed instruction
    to the DECODER of the Function Unit which will calculate the result--
    so those Muxes will be there ANYWAY !! and after decode one can directly
    route the instruction to its station entry.

    Also, consider that alternative to variable length instructions
    is to use longer instructions or more of them. In case of constants
    classic RISC needs more instructions to assemble a constant than
    extra words in variable length encoding. So classic RISC is
    going to need more "decode events" than machine using variable
    length encoding with 32-bit units, even though all "decode events"
    are "useful" on RISC and some "decode events" on variable length
    machine are discarded.

    About 7 My 66000 instructions do the work of 10 RISC-V instructions.
    {{And from what little I have seen, a similar ratio applies to ARM}}

    Given that a lot of GBOoO stuff is quadratic in width, a 7-wide My
    66000 should have only about 1/2 that of the 10-wide RISC-V. This
    includes {Rename ports, Reservation Station ports, Result bus ports,
    Reorder Buffer ports, and sometimes Predictor ports because wider
    means more branches per DECODE cycle. Function units are more linear
    caches about x^1.3-1.4 per port.

    Also note: The instruction used to past constants together add
    latency to the calculation making something that should start
    executing immediately take 4-6 cycles of delay--causing the
    execution window to NEED to be bigger--which costs the Renammer
    and Reorder Buffer, and station entries.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Dec 26 12:12:05 2025
    From Newsgroup: comp.arch

    Maybe a way to avoid that problem
    is to make the renaming architectural. I.e. add a "register renaming >>table" (RRT), and introduce the instruction RENAME which changes
    that RRT. Whenever an instruction wants to read register Rn, the actual >>architectural register we'll read is obtained by passing `n` through RRT.

    All of that happens with microarchitectural renaming (your RRT is
    called RAT (register alias table), however). Your "RENAME"
    instruction is called "MOV". Why make the RAT architectural?

    Good question. I was just reacting to Mitch who seemed to say that one
    of the main problems with a multi-move instruction is that it has too
    many output and that doesn't fit into the general design, so by making
    the RRT/RAT architectural it makes the instruction single-output.
    I don't know if in practice it would make any difference.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 26 18:25:18 2025
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    Maybe a way to avoid that problem
    is to make the renaming architectural. I.e. add a "register renaming >>table" (RRT), and introduce the instruction RENAME which changes
    that RRT. Whenever an instruction wants to read register Rn, the actual >>architectural register we'll read is obtained by passing `n` through RRT.

    All of that happens with microarchitectural renaming (your RRT is
    called RAT (register alias table), however). Your "RENAME"
    instruction is called "MOV". Why make the RAT architectural?

    Good question. I was just reacting to Mitch who seemed to say that one
    of the main problems with a multi-move instruction is that it has too
    many output and that doesn't fit into the general design, so by making
    the RRT/RAT architectural it makes the instruction single-output.
    I don't know if in practice it would make any difference.

    There is a whole "bunch of things" that are being conflated here.
    a) single cycle renamers do not do 0-cycle MOVs;
    b) whereas once the renamer takes 3 cycles, you are pretty much required
    to perform 0-cycle moves in order to make up for the renaming latency.
    c) getting register specifiers to the rename ports becomes harder
    as the number of writes per instruction goes up.
    d) LDM instructions are a special case because the architectural registers
    are all sequential, so one can special-case architectural delays while
    renaming fairly easily.
    e) the data-path can perform as many MOVs per cycle as it has Function
    Units--so, you are not <typically> buying calculation "slots" when
    doing 0-cycle MOVs.

    And a lot of this comes down to HOW one renames registers in your implementation.

    A physical register file is essentially the logical outcome when the
    Reorder Buffer becomes big enough that <almost> no RF reads come from
    the RF and <essentially> all come from the ROB. Here you add the
    architectural RF-names to the ROB and avoid data movement at retire.
    Mc 88120 had such an organization.

    Architectural registers were read by CAM, and each CAM had a valid bit.
    There is always a CAM with the Architectural Register Number in a valid
    state. When matched, the CAM selected the register and it was read out,
    in addition, there was a 3-state "state" read out, and the Physical
    Register Name and which Function unit would deliver this pending result.

    All of this would be dumped into the Reservation station and (Mc88120)
    would write this into Reservation Station. If the value was in the pending state it would be forwarded into RS. If RS was not launching instruction,
    The just-Decoded instruction would be launched into Execution (just in case
    and checked later).

    Each cycle, the valid bits of the CAM were transferred into the History
    buffer, there were 2-bits transferred, the valid bits if the Decode was
    backed up (for any reason) and the valid bits if the Decode was successful. There was a layer of logic between entries in the History Buffer that amalgamated the register status, so that one could retire all Decode
    cycles in a single cycle (catch up BW).

    When a branch instruction was Decoded, the instruction gets associated with the index of the History Buffer as a checkpoint. Mc88120 read the "instruction cache" twice per cycle, once on the predicted direction and once on the backup direction. The backup direction was placed in the recovery buffer with the index of the branch in its RS.

    When a branch instruction was launched, it provided an index into History Buffer, and if the branch had to be backed up, the History buffer could
    provide the valid bits for the subsequent Decode cycle with 0-delay. In
    order for this to work (0-cycle recovery) as the branch was launched the recovery buffer was read, and if the branch was mispredicted, we already
    had instructions from the non-predicted path to feed into Decode.

    So, here, DECODE, RF read, Rename, Checkpointing, data-flow forwarding,
    was all integrated into a single "resource" with a single coordinating sequencer.

    Rename: you could say it took 1-cycle, but was commensurate with Decode. Recovery: you could say it took 1-cycle, but was commensurate with Branch resolution.
    RF Backup: you could say it took 1-cycle, but was commensurate with Branch resolution.

    All of which are a far cry from {3,4,5} cycle Decode+Rename.
    And {2,3,4} cycle Backup.

    Also note: Due to handling all of this in unary form, we could backup a
    branch AND retire 1 or more Groups in the same cycle with a single OR gate
    per renamable register.

    Given the 1-cycle Decode->RS and the 0-cycle Branch mispredict recovery
    we found no "particular" benefit to 0-cycle MOVs.

    All of this was 1991-92. We were even getting 2.2 IPC out of SPECint XLISP, averaging 3.1 IPC from SPECint, and getting 5.97 IPC from MATRIX300 without
    a L2 cache from Mc 88100 ISA from code compiled for 88100 without modification and with a 4KB GShare branch predictor and a 16KB DM 4-banked DCache.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Dec 28 16:22:06 2025
    From Newsgroup: comp.arch

    On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
    John Savard <quadibloc@invalid.invalid> wrote:
    On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually
    correspond to the end of the previous instruction,

    Oh, yes, I had always realized that, but dismissed it as far too
    wasteful.

    Well, Mitch claims average 35 bits per instructions, that means about
    90% utilization of decoders, so not bad.

    His minimum instruction size is 32 bits, but I was going for 16 bits.

    Also, consider that alternative to variable length instructions is to
    use longer instructions or more of them.

    What I did instead was use variable-length instructions, but add a prefix
    at the beginning of any 256-bit block of instructions that contained them which directly showed where each instruction began.

    My intent was to avoid the disadvantages you identify for fixed-length instructions, but avoid the disadvantage of variable-length instructions
    too.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun Dec 28 12:39:19 2025
    From Newsgroup: comp.arch

    Well, Mitch claims average 35 bits per instructions, that means about
    90% utilization of decoders, so not bad.
    His minimum instruction size is 32 bits, but I was going for 16 bits.

    BTW, my understanding of Mitch's design is that this is related to
    instruction complexity: if you support 16bit instructions, it means you
    support instructions which presumably don't do very much work because
    it's hard to express a lot of "work to do" in 16bit.

    Instead, the My 66000 ISA tries to make instructions fatter, so as to
    reduce the number of instructions rather than the size of each instruction.
    And the idea is that this applies both to static and to dynamic counts.

    That's why Mitch includes negation and sign-extension directly inside
    every arithmetic instruction. The hope is that they don't increase the critical path (in the combinatory logic of a single cycle), or they
    increase it less than the corresponding decrease in the other critical
    path (the one in the dataflow graph of instructions).

    Another way to look at it: For the execution of any specific
    instruction, we spend N1 gate-delays on useful work, N2 gate-delays
    waiting for the end of the cycle (because the duration of cycle is based
    on the maximum of all possible N1s), and N3 gate-delays on latching.
    Fatter instructions are a way to try and reduce N2 and the number of
    times we pay N3.

    I wish I knew how to make an ISA where the single cycle instructions
    can perform even more work like two or more dependent additions.
    [ I mean, I know of ways to do it, but they all tend to increase N2
    much too much on average. ]


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 17:59:02 2025
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    Well, Mitch claims average 35 bits per instructions, that means about
    90% utilization of decoders, so not bad.
    His minimum instruction size is 32 bits, but I was going for 16 bits.

    BTW, my understanding of Mitch's design is that this is related to instruction complexity: if you support 16bit instructions, it means you support instructions which presumably don't do very much work because
    it's hard to express a lot of "work to do" in 16bit.

    Instead, the My 66000 ISA tries to make instructions fatter, so as to
    reduce the number of instructions rather than the size of each instruction. And the idea is that this applies both to static and to dynamic counts.

    AND more importantly--LATENCY !

    That's why Mitch includes negation and sign-extension directly inside
    every arithmetic instruction. The hope is that they don't increase the critical path (in the combinatory logic of a single cycle), or they
    increase it less than the corresponding decrease in the other critical
    path (the one in the dataflow graph of instructions).

    If your adder has a carry in, my adder has no more gates of delay.

    Another way to look at it: For the execution of any specific
    instruction, we spend N1 gate-delays on useful work, N2 gate-delays
    waiting for the end of the cycle (because the duration of cycle is based
    on the maximum of all possible N1s), and N3 gate-delays on latching.
    Fatter instructions are a way to try and reduce N2 and the number of
    times we pay N3.

    Pretty spot on.

    I wish I knew how to make an ISA where the single cycle instructions
    can perform even more work like two or more dependent additions.

    Data-General Nova. However, with a modern RISC-like ISA, there are not
    enough small-shifts to amortize--except in the memory addressing arena
    where scaled indexing saves instructions and cycles.

    [ I mean, I know of ways to do it, but they all tend to increase N2
    much too much on average. ]

    If you understand My 66000 ISA implementation, you will find that
    each stage in the pipeline has 1 more gate-of-delay than a typical
    RISC-V pipeline. Given a 16-gate-delay logical pipeline (21 gates
    per clock) I lose 5% while gaining 40%. I consider this a good trade-
    off. With modern wire versus gate delays, I am losing less than 5%
    while still gaining that 40%. {Hint 1/70% = 140%)


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun Dec 28 13:50:38 2025
    From Newsgroup: comp.arch

    MitchAlsup [2025-12-28 17:59:02] wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> posted:
    I wish I knew how to make an ISA where the single cycle instructions
    can perform even more work like two or more dependent additions.
    Data-General Nova. However, with a modern RISC-like ISA, there are not
    enough small-shifts to amortize--except in the memory addressing arena
    where scaled indexing saves instructions and cycles.

    My thoughts were something along the lines of having fat instructions
    like a 3-in 2-out 2-op instruction that does:

    Rd1 <= Rs1 OP1 Rs2;
    Rd2 <= Rs3 OP2 Rd1;

    so your datapath has two ALUs back to back in a single cycle. And the
    problem is that it's often hard to find something useful to do in that
    OP2. To increase the use of OP2 you need to allow as many combinations
    of OP1 and OP2 and that quickly bumps into the constraints that OP1+OP2
    are done in a single cycle, so neither OP1 nor OP2 can usefully be
    memory memory access or control flow operations.

    Those 2 ALUs would likely lengthen the cycle by significantly more than
    your single gate-of-delay, so it's important for OP2 to make useful work
    most of the time, otherwise we just increased the average N2.
    [ And then there's the impact of 3-in 2-out on the pipeline, and the fact
    that such a multi-op instruction doesn't fit in 32bit, of course. ]


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 19:05:16 2025
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    MitchAlsup [2025-12-28 17:59:02] wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> posted:
    I wish I knew how to make an ISA where the single cycle instructions
    can perform even more work like two or more dependent additions.
    Data-General Nova. However, with a modern RISC-like ISA, there are not enough small-shifts to amortize--except in the memory addressing arena where scaled indexing saves instructions and cycles.

    My thoughts were something along the lines of having fat instructions
    like a 3-in 2-out 2-op instruction that does:

    Rd1 <= Rs1 OP1 Rs2;
    Rd2 <= Rs3 OP2 Rd1;

    so your datapath has two ALUs back to back in a single cycle.

    SuperSPARC tried this, it does not work "all that well".

    And the problem is that it's often hard to find something useful to do in that
    OP2.

    As SuperSPARC found out.

    To increase the use of OP2 you need to allow as many combinations
    of OP1 and OP2 and that quickly bumps into the constraints that OP1+OP2
    are done in a single cycle, so neither OP1 nor OP2 can usefully be
    memory memory access or control flow operations.

    Those 2 ALUs would likely lengthen the cycle by significantly more than
    your single gate-of-delay, so it's important for OP2 to make useful work
    most of the time, otherwise we just increased the average N2.

    One might notice that None of the SPARC generations were anywhere close to
    the frequency of the more typical RISCs.

    [ And then there's the impact of 3-in 2-out on the pipeline, and the fact
    that such a multi-op instruction doesn't fit in 32bit, of course. ]

    Instruction Fusing.

    But don't do this to your ISA.....


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Dec 28 11:15:36 2025
    From Newsgroup: comp.arch

    On 12/28/2025 8:22 AM, John Savard wrote:
    On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
    John Savard <quadibloc@invalid.invalid> wrote:
    On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually >>>> correspond to the end of the previous instruction,

    Oh, yes, I had always realized that, but dismissed it as far too
    wasteful.

    Well, Mitch claims average 35 bits per instructions, that means about
    90% utilization of decoders, so not bad.

    His minimum instruction size is 32 bits, but I was going for 16 bits.

    Also, consider that alternative to variable length instructions is to
    use longer instructions or more of them.

    What I did instead was use variable-length instructions, but add a prefix
    at the beginning of any 256-bit block of instructions that contained them which directly showed where each instruction began.

    My intent was to avoid the disadvantages you identify for fixed-length instructions, but avoid the disadvantage of variable-length instructions
    too.

    I understand your goal, however . . .

    How many bits do you "waste" on the prefix?

    Since I think any branch must target the beginning of a block, and in
    general, a routine will not end on a block boundary, there will be
    "wasted" bits at the end of the last block before a "label". Have you determined for a "typical" program, how many bits are wasted due to this?

    The point I am making is that you will "cancel" at least some of the
    savings of 16 bit instructions. You should take this into account
    before committing to your plan.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Dec 28 22:20:52 2025
    From Newsgroup: comp.arch

    On Sun, 28 Dec 2025 19:05:16 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    One might notice that None of the SPARC generations were anywhere
    close to the frequency of the more typical RISCs.


    What "more typical RISC" do you have in mind?

    Wikipedia says that ULTRASparc was launched in Nov 1995 and had
    frequency of up to 200 Mz.

    For comparison, MIPS R8000 reached 90 MHz in mid 1995. MIPS R10000 was
    launched in Jan 1996 at maximal frequency of 195 MHz.

    PA-8000 was introduced in Nov 1995 and shipped in 1996 at maximal
    frequency of 180 MHz.

    20 years later (Oct 2015) SPARC M7 was launched at 4.13 GHz,
    pretty close to contemporary POWER.
    22 years later (Sep 2017) SPARC T8-1 was launched at 5.0 GHz, faster
    than contemporary POWER.

    And I omitting Fujitsu that also sometimes went for relatively
    high clock speed.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 28 12:37:47 2025
    From Newsgroup: comp.arch

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3} >>>>>>
    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the
    constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken >>>>> here...

    when I read this, I thought that there was a standard technique for >>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no
    different than any other calculation, except that no mangling of the
    bits is going on.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you
    would
    know all about that, there must be a reason why it doesn't apply in >>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember
    something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..

    The 2-operand+displacement LD/STs have a lock bit in the instruction--
    that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.

    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless
    impl. If we on another system and that LOCK XADD is some sort of LL/SC
    "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 21:32:38 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/28/2025 8:22 AM, John Savard wrote:
    On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
    John Savard <quadibloc@invalid.invalid> wrote:
    On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually >>>> correspond to the end of the previous instruction,

    Oh, yes, I had always realized that, but dismissed it as far too
    wasteful.

    Well, Mitch claims average 35 bits per instructions, that means about
    90% utilization of decoders, so not bad.

    His minimum instruction size is 32 bits, but I was going for 16 bits.

    Also, consider that alternative to variable length instructions is to
    use longer instructions or more of them.

    What I did instead was use variable-length instructions, but add a prefix at the beginning of any 256-bit block of instructions that contained them which directly showed where each instruction began.

    My intent was to avoid the disadvantages you identify for fixed-length instructions, but avoid the disadvantage of variable-length instructions too.

    I understand your goal, however . . .

    How many bits do you "waste" on the prefix?

    For code like:: (undoctored compiler output)

    .LBB1_34:
    mov r1,#0
    br .LBB1_35
    .LBB1_39:
    mov r1,#28
    exit r16,r0,0,32
    .LBB1_36:
    mov r1,#0
    exit r16,r0,0,32
    .LBB1_37:
    call processRestart
    beq0 r1,.LBB1_38
    .LBB1_35:
    exit r16,r0,0,32
    .LBB1_38:
    lduh r1,[ip,gRestartsLeft]
    br .LBB1_2
    .Lfunc_end1:

    You are going to eat a lot of headers for 2 instruction BBs.

    Also:: Can you return to the middle of a block ??
    Can you make multiple CALLs from a single block ??
    Can you fit an entire loop in a single block ??
    Can you fit a loop with calls in a single block ??
    Does each unique label of a switch(i) require its own block ??

    Since I think any branch must target the beginning of a block, and in general, a routine will not end on a block boundary, there will be
    "wasted" bits at the end of the last block before a "label". Have you determined for a "typical" program, how many bits are wasted due to this?

    The point I am making is that you will "cancel" at least some of the
    savings of 16 bit instructions. You should take this into account
    before committing to your plan.

    I suspect the block-boundaries will consume more space than the 16-bit instructions can possibly save.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 22:04:45 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3} >>>>>>
    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken >>>>> here...

    when I read this, I thought that there was a standard technique for >>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>> different than any other calculation, except that no mangling of the >>>> bits is going on.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember >>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..

    The 2-operand+displacement LD/STs have a lock bit in the instruction--
    that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.

    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier >> {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless impl. If we on another system and that LOCK XADD is some sort of LL/SC "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    ATOMIC-to-Memory-size [address]
    ADD Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped
    out over the memory hierarchy. When the operation runs into a cache
    containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol
    depends on writeable implying {exclusive or modified} which is typical.}

    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 28 14:17:49 2025
    From Newsgroup: comp.arch

    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3} >>>>>>>>
    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken >>>>>>> here...

    when I read this, I thought that there was a standard technique for >>>>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>>>> different than any other calculation, except that no mangling of the >>>>>> bits is going on.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>
    The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>> that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.

    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier >>>> {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless
    impl. If we on another system and that LOCK XADD is some sort of LL/SC
    "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    ATOMIC-to-Memory-size [address]
    ADD Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped
    out over the memory hierarchy. When the operation runs into a cache containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol
    depends on writeable implying {exclusive or modified} which is typical.}

    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 28 14:20:44 2025
    From Newsgroup: comp.arch

    On 12/28/2025 2:17 PM, Chris M. Thomasson wrote:
    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj}, >>>>>>>>> {R1..R3}

    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},
    {Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm >>>>>>>> mistaken
    here...

    when I read this, I thought that there was a standard technique for >>>>>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>> different than any other calculation, except that no mangling of the >>>>>>> bits is going on.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I
    remember
    something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>
    The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>> that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.

    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-
    Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>> impl. If we on another system and that LOCK XADD is some sort of LL/SC
    "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    -a-a-a-a-a ATOMIC-to-Memory-size-a [address]
    -a-a-a-a-a ADD-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped
    out over the memory hierarchy. When the operation runs into a cache
    containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol
    depends on writeable implying {exclusive or modified} which is typical.}

    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    Fwiw, I noticed that a certain compiler was implementing LOCK XADD with
    a LOCK CMPXCHG loop and got a little pissed. Had to tell them about it:

    read all when you get some free time to burn:

    https://forum.pellesc.de/index.php?topic=7167.msg27217#msg27217
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Dec 28 17:48:07 2025
    From Newsgroup: comp.arch

    John Savard wrote:
    On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
    John Savard <quadibloc@invalid.invalid> wrote:
    On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually >>>> correspond to the end of the previous instruction,
    Oh, yes, I had always realized that, but dismissed it as far too
    wasteful.
    Well, Mitch claims average 35 bits per instructions, that means about
    90% utilization of decoders, so not bad.

    His minimum instruction size is 32 bits, but I was going for 16 bits.

    Also, consider that alternative to variable length instructions is to
    use longer instructions or more of them.

    What I did instead was use variable-length instructions, but add a prefix
    at the beginning of any 256-bit block of instructions that contained them which directly showed where each instruction began.

    My intent was to avoid the disadvantages you identify for fixed-length instructions, but avoid the disadvantage of variable-length instructions too.

    John Savard

    I would not find that block prefix to be helpful.

    My TTL risc-ish VAX design converged on 16-bit Instruction Granules (IG-16)
    for its variable length instructions because that allowed:
    (a) to have instructions from 1 to 6 IG where the longest instruction has
    a 4-byte operation specifier (opspec) with an 8-byte FP64 immediate
    (b) sustained fetch/parse and decode of 1 instruction per 200 ns clock
    (c) fit the Fetch/Parse unit on a single VAX sized PCB (15" x 15")
    and Decode unit on a second board.

    the ISA has a 32-bit instruction pointer, 16 32-bit integer registers,
    16 64-bit floating point registers, and 32-bit virtual address space.
    The 4-bit register specifiers are necessary to fit into 16-bit granules.
    There is currently no flags register.

    My requirement for parsing the instruction length of 1-6 granules is only
    that it is *encoded someplace in the first 12 bits of the first IG*.
    But this doesn't require a full 12:4096 decoder as its only certain
    specific patterns that it must detect to extract the granule count.
    This length parse can be done by a PLA with a small amount of glue logic.

    This leaves the maximum number of IG-1 encodings available for
    instructions that can fit into the smallest size.
    And the instructions I put into IG-1 formats are all the
    highest frequency of occurrence.

    IG-1 instructions have 3 operation code (opcode) formats:
    - oc16 16-bit opcode
    - oc12-1R 12-bit opcode 1 register
    - oc8-2R 8-bit opcode 2 registers

    Some oc8-2R instructions are used for what I call "accumulate" style
    where one register is both source and destination,
    and these 2-register operations cover about 40% of the instruction usage:
    ADDA Rsd1 = Rsd1 + Rs2
    SUBA Rsd1 = Rsd1 - Rs2
    also for integer MULS, MULU, DIVS, DIVU AND, OR, XOR,
    and floating point FADDS, FADDD, FSUBS, FSUBD, ...

    Also certain loads and stores for specific high frequency data types
    which are signed byte, 32-bit word, fp32 single and fp64 double,
    and which use direct *ptr addressing can fit into oc8-2R format:

    LDB Rd1,[Rs2] Load Byte sign extend
    LDW Rd1,[Rs2] Load Word (32-bit)
    FLDS Fd1,[Rs2] Float Load Single
    FLDD Fd1,[Rs2] Float Load Double

    and same for stores, which covers about 10-15% of the LD and ST.

    Then I look at the IG-2 formats and pack as many of the
    remaining highest frequency instructions into 32 bits.
    And so on.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 28 22:51:04 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Well, Mitch claims average 35 bits per instructions, that means about
    90% utilization of decoders, so not bad.
    His minimum instruction size is 32 bits, but I was going for 16 bits.

    BTW, my understanding of Mitch's design is that this is related to instruction complexity: if you support 16bit instructions, it means you support instructions which presumably don't do very much work because
    it's hard to express a lot of "work to do" in 16bit.

    A bit of statistics on that.

    Using a primitive Perl script to catch occurences, on a recent
    My 66000 cmopiler, of the shape

    [op] Ra,Ra,Rb
    [op] Ra,Rb,Ra
    [op] Ra,#n,Ra
    [op] Ra,Ra,#n
    [op] Ra,Rb

    where |n| < 32, which could be a reasonable approximation of a
    compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
    and 23.9% (GSL) of such instructions. Potential space savings
    would be a bit less than half that.

    Better compression schemes are certainly possible, but I think the disadvantages of having more complex encodings outweigh any
    potential savings in instruction size.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Dec 28 15:19:41 2025
    From Newsgroup: comp.arch

    On 12/28/2025 1:32 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/28/2025 8:22 AM, John Savard wrote:
    On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
    John Savard <quadibloc@invalid.invalid> wrote:
    On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:

    It is certainly possible to decode potential instructions at every >>>>>> starting position in parallel, and later select the ones that actually >>>>>> correspond to the end of the previous instruction,

    Oh, yes, I had always realized that, but dismissed it as far too
    wasteful.

    Well, Mitch claims average 35 bits per instructions, that means about
    90% utilization of decoders, so not bad.

    His minimum instruction size is 32 bits, but I was going for 16 bits.

    Also, consider that alternative to variable length instructions is to
    use longer instructions or more of them.

    What I did instead was use variable-length instructions, but add a prefix >>> at the beginning of any 256-bit block of instructions that contained them >>> which directly showed where each instruction began.

    My intent was to avoid the disadvantages you identify for fixed-length
    instructions, but avoid the disadvantage of variable-length instructions >>> too.

    I understand your goal, however . . .

    How many bits do you "waste" on the prefix?

    For code like:: (undoctored compiler output)

    .LBB1_34:
    mov r1,#0
    br .LBB1_35
    .LBB1_39:
    mov r1,#28
    exit r16,r0,0,32
    .LBB1_36:
    mov r1,#0
    exit r16,r0,0,32
    .LBB1_37:
    call processRestart
    beq0 r1,.LBB1_38
    .LBB1_35:
    exit r16,r0,0,32
    .LBB1_38:
    lduh r1,[ip,gRestartsLeft]
    br .LBB1_2
    .Lfunc_end1:

    You are going to eat a lot of headers for 2 instruction BBs.

    Also:: Can you return to the middle of a block ??
    Can you make multiple CALLs from a single block ??
    Can you fit an entire loop in a single block ??
    Can you fit a loop with calls in a single block ??
    Does each unique label of a switch(i) require its own block ??

    Since I think any branch must target the beginning of a block, and in
    general, a routine will not end on a block boundary, there will be
    "wasted" bits at the end of the last block before a "label". Have you
    determined for a "typical" program, how many bits are wasted due to this?

    The point I am making is that you will "cancel" at least some of the
    savings of 16 bit instructions. You should take this into account
    before committing to your plan.

    I suspect the block-boundaries will consume more space than the 16-bit instructions can possibly save.

    I agree with your suspicions. I was trying to get John to reach that
    same conclusion. :-)
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 23:48:43 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/28/2025 8:22 AM, John Savard wrote:
    On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
    John Savard <quadibloc@invalid.invalid> wrote:
    On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:

    It is certainly possible to decode potential instructions at every
    starting position in parallel, and later select the ones that actually >>>> correspond to the end of the previous instruction,

    Oh, yes, I had always realized that, but dismissed it as far too
    wasteful.

    Well, Mitch claims average 35 bits per instructions, that means about
    90% utilization of decoders, so not bad.

    His minimum instruction size is 32 bits, but I was going for 16 bits.

    Also, consider that alternative to variable length instructions is to
    use longer instructions or more of them.

    What I did instead was use variable-length instructions, but add a prefix at the beginning of any 256-bit block of instructions that contained them which directly showed where each instruction began.

    My intent was to avoid the disadvantages you identify for fixed-length instructions, but avoid the disadvantage of variable-length instructions too.

    I understand your goal, however . . .

    How many bits do you "waste" on the prefix?

    Since I think any branch must target the beginning of a block, and in general, a routine will not end on a block boundary, there will be
    "wasted" bits at the end of the last block before a "label". Have you determined for a "typical" program, how many bits are wasted due to this?

    The point I am making is that you will "cancel" at least some of the
    savings of 16 bit instructions. You should take this into account
    before committing to your plan.

    Consider "Duff's Device"::

    {
    int n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *to = *from++;
    case 7: *to = *from++;
    case 6: *to = *from++;
    case 5: *to = *from++;
    case 4: *to = *from++;
    case 3: *to = *from++;
    case 2: *to = *from++;
    case 1: *to = *from++;
    } while (--n > 0);
    }
    }

    It seems to me that each *to++ = *from**; will be in its own block ?!?!
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 23:53:09 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for >>>>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>>>> different than any other calculation, except that no mangling of the >>>>>> bits is going on.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>
    The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>> that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.

    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >> impl. If we on another system and that LOCK XADD is some sort of LL/SC
    "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    ATOMIC-to-Memory-size [address]
    ADD Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is performed and line updated. Otherwise, the Add-to-memory #1 is shipped
    out over the memory hierarchy. When the operation runs into a cache containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol depends on writeable implying {exclusive or modified} which is typical.}

    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Dec 28 18:41:17 2025
    From Newsgroup: comp.arch

    On 12/28/2025 5:53 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for >>>>>>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
    The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    ATOMIC-to-Memory-size [address]
    ADD Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped
    out over the memory hierarchy. When the operation runs into a cache
    containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol
    depends on writeable implying {exclusive or modified} which is typical.} >>>
    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    IMHO:
    No-Cache + CAS is probably a better bet than LL/SC;
    LL/sC: Depends on the existence of explicit memory-coherency features.
    No-Cache + CAS: Can be made to work independent of the underlying memory model.

    Granted, No-Cache is its own feature:
    Need some way to indicate to the L1 cache that special handling is
    needed for this memory access and cache line (that it should not use a previously cached value and should be flushed immediately once the
    operation completes).


    But, No-Cache behavior is much easier to fake on a TSO capable memory subsystem, than it is to accurately fake LL/SC on top of weak-model
    write-back caches.

    If the memory system implements TSO or similar, then one can simply
    ignore the No-Cache behavior and achieve the same effect.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Dec 29 09:54:30 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Well, Mitch claims average 35 bits per instructions, that means about
    90% utilization of decoders, so not bad.
    His minimum instruction size is 32 bits, but I was going for 16 bits.
    BTW, my understanding of Mitch's design is that this is related to
    instruction complexity: if you support 16bit instructions, it means you
    support instructions which presumably don't do very much work because
    it's hard to express a lot of "work to do" in 16bit.

    A bit of statistics on that.

    Using a primitive Perl script to catch occurences, on a recent
    My 66000 cmopiler, of the shape

    [op] Ra,Ra,Rb
    [op] Ra,Rb,Ra
    [op] Ra,#n,Ra
    [op] Ra,Ra,#n
    [op] Ra,Rb

    where |n| < 32, which could be a reasonable approximation of a
    compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
    and 23.9% (GSL) of such instructions. Potential space savings
    would be a bit less than half that.

    Better compression schemes are certainly possible, but I think the disadvantages of having more complex encodings outweigh any
    potential savings in instruction size.

    The % associations you measured above might just be coincidence.

    I have assumed for a compiler to choose between two instruction formats,
    a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
    that the register allocator would check if either operand was alive after
    the OP, and if not then that source register can be reused as the dest.
    For some ISA that may allow a shorter instruction format to be used.

    Your stats above assume the compiler is performing this optimization
    but since My 66000 does not have short format instructions the compiler
    would have no reason to do so. Or the compiler might be doing this
    optimization anyways for other ISA such as x86/x64 which do have
    shorter formats.

    So the % numbers you measured might just be coincidence and could be low.
    An ISA with both short 2- and long 3- register formats like RV where there
    is an incentive to do this optimization might provide stats confirmation.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Dec 29 13:29:58 2025
    From Newsgroup: comp.arch

    My thoughts were something along the lines of having fat instructions
    like a 3-in 2-out 2-op instruction that does:

    Rd1 <= Rs1 OP1 Rs2;
    Rd2 <= Rs3 OP2 Rd1;

    so your datapath has two ALUs back to back in a single cycle.
    SuperSPARC tried this, it does not work "all that well".

    Do you have a reference to that? I can't see any trace of that in the
    SPARC ISA, so I assume it was done via instruction fusion instead?

    One might notice that None of the SPARC generations were anywhere close to the frequency of the more typical RISCs.

    Hmm... I remember Sun being slower to move to OoO, but in terms of
    frequency I thought they were mostly on par with other RISCs of the time
    (and back then, SPARC was one of the top two "typical RISCs", AFAIK).


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Dec 29 13:41:11 2025
    From Newsgroup: comp.arch

    BTW, when discussing ISA compactness, I usually see it measured by
    comparing the size of the code segment in typical executables.

    I understand that it's as good a measure as any and it's one that's
    fairly easily available, but at the same time it's not necessarily one
    that actually matters since I expect that it affects a usually fairly
    small proportion of the total ROM/Flash/disk space.

    I wonder if there have been other studies to explore other impacts such
    as run time, or cache miss rate.


    Stefan


    EricP [2025-12-29 09:54:30] wrote:

    Thomas Koenig wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Well, Mitch claims average 35 bits per instructions, that means about >>>>> 90% utilization of decoders, so not bad.
    His minimum instruction size is 32 bits, but I was going for 16 bits.
    BTW, my understanding of Mitch's design is that this is related to
    instruction complexity: if you support 16bit instructions, it means you
    support instructions which presumably don't do very much work because
    it's hard to express a lot of "work to do" in 16bit.
    A bit of statistics on that.
    Using a primitive Perl script to catch occurences, on a recent
    My 66000 cmopiler, of the shape
    [op] Ra,Ra,Rb
    [op] Ra,Rb,Ra
    [op] Ra,#n,Ra
    [op] Ra,Ra,#n
    [op] Ra,Rb
    where |n| < 32, which could be a reasonable approximation of a
    compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
    and 23.9% (GSL) of such instructions. Potential space savings
    would be a bit less than half that.
    Better compression schemes are certainly possible, but I think the
    disadvantages of having more complex encodings outweigh any
    potential savings in instruction size.

    The % associations you measured above might just be coincidence.

    I have assumed for a compiler to choose between two instruction formats,
    a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
    that the register allocator would check if either operand was alive after
    the OP, and if not then that source register can be reused as the dest.
    For some ISA that may allow a shorter instruction format to be used.

    Your stats above assume the compiler is performing this optimization
    but since My 66000 does not have short format instructions the compiler
    would have no reason to do so. Or the compiler might be doing this optimization anyways for other ISA such as x86/x64 which do have
    shorter formats.

    So the % numbers you measured might just be coincidence and could be low.
    An ISA with both short 2- and long 3- register formats like RV where there
    is an incentive to do this optimization might provide stats confirmation.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Dec 29 18:35:14 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Thomas Koenig wrote:
    Using a primitive Perl script to catch occurences, on a recent
    My 66000 cmopiler, of the shape

    [op] Ra,Ra,Rb
    [op] Ra,Rb,Ra
    [op] Ra,#n,Ra
    [op] Ra,Ra,#n
    [op] Ra,Rb

    where |n| < 32, which could be a reasonable approximation of a
    compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
    and 23.9% (GSL) of such instructions. Potential space savings
    would be a bit less than half that.

    Better compression schemes are certainly possible, but I think the
    disadvantages of having more complex encodings outweigh any
    potential savings in instruction size.

    The RISC-V people brag about how little their compressed encoding
    costs to decode; IIRC it's in the hundreds of something (not sure if transistors or gates). Of course, with superscalar decoding the
    compressed instruction set costs additional decoders plus logic to
    select which decodings do not belong to actual instructions, but
    that's true for any 16+32-bit encoding, however simple.

    So the % numbers you measured might just be coincidence and could be low.
    An ISA with both short 2- and long 3- register formats like RV where there
    is an incentive to do this optimization might provide stats confirmation.

    I have done the following on a RV64GC system with Fedora 33:

    objdump -d /lib64/lp64d/libperl.so.5.32|grep '^ *[0-9a-f]*:'|awk '{print length($2)}'|sort|uniq -c
    215782 4
    179493 8

    16-bit instructions are reported as 4 (4 hex digits), 32-bit
    instructions are reported as 8.

    If the actual binary /usr/bin/perl is meant, here's the stats for that:

    objdump -d /usr//bin/perl|grep '^ *[0-9a-f]*:'|awk '{print length($2)}'|sort|uniq -c
    105 4
    167 8

    gnuplot is not installed, and GSL is not installed, either, whatever
    it may be.

    Just to widen the basis, here are a few more:

    zstd:
    129569 4
    134985 8

    git:
    305090 4
    274053 8

    /usr/lib64/libc-2.32.so:
    142208 4
    113455 8

    So the percentage of 16-bit instructions is a lot higher than for the
    schemes that Thomas Koenig has looked at.

    Another way to approach this question is to look at the current
    champion of fixed instruction width, ARM A64, consider those
    instructions (and addressing modes) that ARM A64 has and RISC-V does
    not have, and look at how often they are used, and how many RISC-V
    instructions are needed to replace them.

    In any case, code density measurements show that both result in
    compact code, with RV64GC having more compact code, and actually
    having the most compact code among the architectures present in all
    rounds of my measurements where RV64GC was present.

    But code size is not everything. For ARM A64, you pay for it by the
    increased complexity of implementing these instructions (in particular
    the many register ports) and addressing modes. For bigger
    implementations, instruction combining means additional front-end
    effort for RISC-V, and then maybe similar implementation effort for
    the combined instructions as for ARM A64 (but more flexibility in
    selecting which instructions to combine). And, as mentioned above,
    the additional decoding effort.

    When we look at actual implementations, RISC-V has not reached the
    widths that ARM A64 has reached, but I guess that this is more due to
    the current potential markets for these two architectures than due to
    technical issues. RISC-V seems to be pushing into server space
    lately, so we may see wider implementations in the not-too-far future.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Mon Dec 29 19:45:00 2025
    From Newsgroup: comp.arch

    In article <jwvpl7xkunj.fsf-monnier+comp.arch@gnu.org>, monnier@iro.umontreal.ca (Stefan Monnier) wrote:

    I wonder if there have been other studies to explore other impacts
    such as run time, or cache miss rate.

    The difficulty there is standardising the input data, and normalising
    processor performance, memory bandwidth and latency, etc. Code segment
    size is much easier to measure.

    John
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 29 19:55:07 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 12/28/2025 5:53 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for >>>>>>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember
    something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
    The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    ATOMIC-to-Memory-size [address]
    ADD Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>> out over the memory hierarchy. When the operation runs into a cache
    containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol
    depends on writeable implying {exclusive or modified} which is typical.} >>>
    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    IMHO:
    No-Cache + CAS is probably a better bet than LL/SC;
    LL/sC: Depends on the existence of explicit memory-coherency features. No-Cache + CAS: Can be made to work independent of the underlying memory model.

    Granted, No-Cache is its own feature:
    Need some way to indicate to the L1 cache that special handling is
    needed for this memory access and cache line (that it should not use a previously cached value and should be flushed immediately once the
    operation completes).


    But, No-Cache behavior is much easier to fake on a TSO capable memory subsystem, than it is to accurately fake LL/SC on top of weak-model write-back caches.

    My 66000 does not have a TSO memory system, but when one of these
    things shows up, it goes sequential consistency, and when it is done
    it flips back to causal consistency.

    TSO is cycle-wasteful.

    If the memory system implements TSO or similar, then one can simply
    ignore the No-Cache behavior and achieve the same effect.

    ..


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 29 20:10:24 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Thomas Koenig wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Well, Mitch claims average 35 bits per instructions, that means about >>>> 90% utilization of decoders, so not bad.
    His minimum instruction size is 32 bits, but I was going for 16 bits.
    BTW, my understanding of Mitch's design is that this is related to
    instruction complexity: if you support 16bit instructions, it means you
    support instructions which presumably don't do very much work because
    it's hard to express a lot of "work to do" in 16bit.

    A bit of statistics on that.

    Using a primitive Perl script to catch occurences, on a recent
    My 66000 cmopiler, of the shape

    [op] Ra,Ra,Rb
    [op] Ra,Rb,Ra
    [op] Ra,#n,Ra
    [op] Ra,Ra,#n
    [op] Ra,Rb

    where |n| < 32, which could be a reasonable approximation of a
    compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
    and 23.9% (GSL) of such instructions. Potential space savings
    would be a bit less than half that.

    Better compression schemes are certainly possible, but I think the disadvantages of having more complex encodings outweigh any
    potential savings in instruction size.

    The % associations you measured above might just be coincidence.

    I have assumed for a compiler to choose between two instruction formats,
    a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
    that the register allocator would check if either operand was alive after
    the OP, and if not then that source register can be reused as the dest.
    For some ISA that may allow a shorter instruction format to be used.

    In my <relative> youth:: I used the k-register notation regularly,
    but as I got older I found it less and less appealing. In time I
    started using k-Operand notation instead so your typical RISC
    calculation instruction would be known as 2-operand {with the
    single 1-result implied}. This notation is more satisfying to me.
    Sometimes I specify both {3-operand:1-result} to remove a chance
    to misinterpret the intended message.

    It also works better with my CARRY instruction-Modifier as CARRY can
    supply one more Operand and consume one more result (both optional).

    Your stats above assume the compiler is performing this optimization
    but since My 66000 does not have short format instructions the compiler
    would have no reason to do so. Or the compiler might be doing this optimization anyways for other ISA such as x86/x64 which do have
    shorter formats.

    So the % numbers you measured might just be coincidence and could be low.

    What you are stating is:: "When the ISA is screwed up" there are "More opportunities to Optimize".

    An ISA with both short 2- and long 3- register formats like RV where there
    is an incentive to do this optimization might provide stats confirmation.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Dec 29 15:14:26 2025
    From Newsgroup: comp.arch

    I wonder if there have been other studies to explore other impacts
    such as run time, or cache miss rate.
    The difficulty there is standardising the input data, and normalising processor performance, memory bandwidth and latency, etc.

    I was thinking of those "compressed" variants of ISAs, such as Thumb,
    Thumb2, MIPS16e, microMIPS, or the "C" option of RISC-V, where you can
    compare with/without on the very same machine since all the half-size instructions are also available in full-size.

    Code segment size is much easier to measure.

    Yes, but!


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 29 20:16:45 2025
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    BTW, when discussing ISA compactness, I usually see it measured by
    comparing the size of the code segment in typical executables.

    I understand that it's as good a measure as any and it's one that's
    fairly easily available, but at the same time it's not necessarily one
    that actually matters since I expect that it affects a usually fairly
    small proportion of the total ROM/Flash/disk space.

    I wonder if there have been other studies to explore other impacts such
    as run time, or cache miss rate.

    Cache miss rate is proportional to SQRT(code size).

    And In my opinion::
    Small changes are essentially irrelevant when there is a wide range
    of implementations on which the code runs.


    Stefan


    EricP [2025-12-29 09:54:30] wrote:

    Thomas Koenig wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Well, Mitch claims average 35 bits per instructions, that means about >>>>> 90% utilization of decoders, so not bad.
    His minimum instruction size is 32 bits, but I was going for 16 bits. >>> BTW, my understanding of Mitch's design is that this is related to
    instruction complexity: if you support 16bit instructions, it means you >>> support instructions which presumably don't do very much work because
    it's hard to express a lot of "work to do" in 16bit.
    A bit of statistics on that.
    Using a primitive Perl script to catch occurences, on a recent
    My 66000 cmopiler, of the shape
    [op] Ra,Ra,Rb
    [op] Ra,Rb,Ra
    [op] Ra,#n,Ra
    [op] Ra,Ra,#n
    [op] Ra,Rb
    where |n| < 32, which could be a reasonable approximation of a
    compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
    and 23.9% (GSL) of such instructions. Potential space savings
    would be a bit less than half that.
    Better compression schemes are certainly possible, but I think the
    disadvantages of having more complex encodings outweigh any
    potential savings in instruction size.

    The % associations you measured above might just be coincidence.

    I have assumed for a compiler to choose between two instruction formats,
    a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
    that the register allocator would check if either operand was alive after the OP, and if not then that source register can be reused as the dest.
    For some ISA that may allow a shorter instruction format to be used.

    Your stats above assume the compiler is performing this optimization
    but since My 66000 does not have short format instructions the compiler would have no reason to do so. Or the compiler might be doing this optimization anyways for other ISA such as x86/x64 which do have
    shorter formats.

    So the % numbers you measured might just be coincidence and could be low. An ISA with both short 2- and long 3- register formats like RV where there is an incentive to do this optimization might provide stats confirmation.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 29 21:08:09 2025
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    My thoughts were something along the lines of having fat instructions
    like a 3-in 2-out 2-op instruction that does:

    Rd1 <= Rs1 OP1 Rs2;
    Rd2 <= Rs3 OP2 Rd1;

    so your datapath has two ALUs back to back in a single cycle.
    SuperSPARC tried this, it does not work "all that well".

    Do you have a reference to that? I can't see any trace of that in the
    SPARC ISA, so I assume it was done via instruction fusion instead?

    It is not in ISA, and it is not "like" instruction Fusion, either.

    When a first instruction had a property*, and a second instruction also
    had a certain property*, they would be issued together into the execution pipeline. The first instruction executes in the first cycle, the second instruction in the second cycle with forwarding of the result of the first
    to the second.

    Where property ~= 1-cycle integer, no setting of CCs, and a few other
    conditions.

    One might notice that None of the SPARC generations were anywhere close to the frequency of the more typical RISCs.

    Hmm... I remember Sun being slower to move to OoO, but in terms of
    frequency I thought they were mostly on par with other RISCs of the time
    (and back then, SPARC was one of the top two "typical RISCs", AFAIK).

    S/frequency/performance/g



    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 29 21:10:56 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Thomas Koenig wrote:
    Using a primitive Perl script to catch occurences, on a recent
    My 66000 cmopiler, of the shape

    [op] Ra,Ra,Rb
    [op] Ra,Rb,Ra
    [op] Ra,#n,Ra
    [op] Ra,Ra,#n
    [op] Ra,Rb

    where |n| < 32, which could be a reasonable approximation of a
    compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
    and 23.9% (GSL) of such instructions. Potential space savings
    would be a bit less than half that.

    Better compression schemes are certainly possible, but I think the
    disadvantages of having more complex encodings outweigh any
    potential savings in instruction size.

    The RISC-V people brag about how little their compressed encoding
    costs to decode; IIRC it's in the hundreds of something (not sure if transistors or gates). Of course, with superscalar decoding the
    compressed instruction set costs additional decoders plus logic to
    select which decodings do not belong to actual instructions, but
    that's true for any 16+32-bit encoding, however simple.

    It may not take "that many gates", but it screws up register specifier
    routing to RF. They also brag that instruction fusion has a 400 gate
    cost. On the other hand, I brag that My 66000 variable length instruction
    only needs 40 gates to "decode".
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Dec 29 21:36:42 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Well, Mitch claims average 35 bits per instructions, that means about >>>>> 90% utilization of decoders, so not bad.
    His minimum instruction size is 32 bits, but I was going for 16 bits.
    BTW, my understanding of Mitch's design is that this is related to
    instruction complexity: if you support 16bit instructions, it means you
    support instructions which presumably don't do very much work because
    it's hard to express a lot of "work to do" in 16bit.

    A bit of statistics on that.

    Using a primitive Perl script to catch occurences, on a recent
    My 66000 cmopiler, of the shape

    [op] Ra,Ra,Rb
    [op] Ra,Rb,Ra
    [op] Ra,#n,Ra
    [op] Ra,Ra,#n
    [op] Ra,Rb

    where |n| < 32, which could be a reasonable approximation of a
    compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
    and 23.9% (GSL) of such instructions. Potential space savings
    would be a bit less than half that.

    Better compression schemes are certainly possible, but I think the
    disadvantages of having more complex encodings outweigh any
    potential savings in instruction size.

    The % associations you measured above might just be coincidence.

    I have assumed for a compiler to choose between two instruction formats,
    a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
    that the register allocator would check if either operand was alive after
    the OP, and if not then that source register can be reused as the dest.
    For some ISA that may allow a shorter instruction format to be used.

    Compilers will try to re-use registers as much as possible, in
    other words, to avoid dead registers. If the compiler determines
    that, for the pseudo registers V1, V2 and V3,

    V1 = V2 - V3;

    V2 is no longer live after that statement, it will assign
    the same hard register to V1 and V2 (unless there are other
    considerations such as function return values) which will then
    either be translated into

    add r1,r1,-r2

    for a three-register instruction, or, for example, into

    subq %rsi, %rax

    Hmm... thinking of the statistics above, maybe I should have
    included the minus signs.

    Your stats above assume the compiler is performing this optimization
    but since My 66000 does not have short format instructions the compiler
    would have no reason to do so. Or the compiler might be doing this optimization anyways for other ISA such as x86/x64 which do have
    shorter formats.

    So the % numbers you measured might just be coincidence and could be low.
    An ISA with both short 2- and long 3- register formats like RV where there
    is an incentive to do this optimization might provide stats confirmation.

    RISC-V compressed mode also uses three-bit register numbers for
    popular registers, all of which complicates decoding and causes
    other problems which Mitch has explained previously.

    So yes, a My 66000-like instruction set with compression might be
    possible, but would almost certainly not be realized.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Dec 29 17:48:29 2025
    From Newsgroup: comp.arch

    Do you have a reference to that? I can't see any trace of that in the
    SPARC ISA, so I assume it was done via instruction fusion instead?
    It is not in ISA, and it is not "like" instruction Fusion, either.
    When a first instruction had a property*, and a second instruction also
    had a certain property*, they would be issued together into the execution pipeline. The first instruction executes in the first cycle, the second instruction in the second cycle with forwarding of the result of the first
    to the second.

    Oh, I see, thanks. Apparently they called this "cascading".


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 29 16:44:35 2025
    From Newsgroup: comp.arch

    On 12/28/2025 4:41 PM, BGB wrote:
    On 12/28/2025 5:53 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj}, >>>>>>>>>>> {R1..R3}

    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},
    {Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within >>>>>>>>>>> the
    constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm >>>>>>>>>> mistaken
    here...

    when I read this, I thought that there was a standard
    technique for
    doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That >>>>>>>>> is no
    different than any other calculation, except that no mangling >>>>>>>>> of the
    bits is going on.

    -a -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the
    fancy
    instructions into RISC-style pseudo-ops. But apparently, since >>>>>>>>>> you
    would
    know all about that, there must be a reason why it doesn't >>>>>>>>>> apply in
    these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I >>>>>>>> remember
    something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, >>>>>>>> ect..

    The 2-operand+displacement LD/STs have a lock bit in the
    instruction--
    that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>>
    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction- >>>>>>> Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a
    loopless
    impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>>> "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    -a-a-a-a-a-a ATOMIC-to-Memory-size-a [address]
    -a-a-a-a-a-a ADD-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>>> out over the memory hierarchy. When the operation runs into a cache
    containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol
    depends on writeable implying {exclusive or modified} which is
    typical.}

    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under
    contention. So you DON"T DO IT LIKE THAT.

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
    Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    IMHO:
    No-Cache + CAS is probably a better bet than LL/SC;

    Fwiw, there is a "weak" CAS in C++ std. I think its there to handle when
    a LL/SC can spuriously fail, aka it can fail even though it should have succeeded... A strong CAS means if it fails the value observed is
    different than the comparand. This is how a LOCK
    CMPXCHG/CMPXCH8b/CMPXCHG16b acts via x86/x64.

    No, having just CAS is not ideal... Akin to what the PellesC guys did to implement a LOCK XADD using a LOCK CMPXCHG loop! I noticed it and
    brought it up:

    https://forum.pellesc.de/index.php?topic=7167.msg27217#msg27217

    CAS always implies a loop, unless it CANNOT fail spuriously. In that
    case, aka LOCK CMPXCHG, it can be used in a state machine. We know a
    failure means what it means. Not, oh shit it failed but we don't exactly
    know why... ala LL/SC...



    LL/sC: Depends on the existence of explicit memory-coherency features. No-Cache + CAS: Can be made to work independent of the underlying memory model.

    Granted, No-Cache is its own feature:
    Need some way to indicate to the L1 cache that special handling is
    needed for this memory access and cache line (that it should not use a previously cached value and should be flushed immediately once the
    operation completes).


    But, No-Cache behavior is much easier to fake on a TSO capable memory subsystem, than it is to accurately fake LL/SC on top of weak-model write-back caches.

    If the memory system implements TSO or similar, then one can simply
    ignore the No-Cache behavior and achieve the same effect.

    ...



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 29 16:59:48 2025
    From Newsgroup: comp.arch

    On 12/28/2025 4:41 PM, BGB wrote:
    [...]

    Also, if using something like LOCK CMPXCHG you MUST make sure to align
    and pad your relevant data structures to a l2 cache line.

    Using LL/SC, you need to align and pad them up to a reservation granule.
    False sharing tends to really nail the LL/SC because it can fail
    spuriously. Oh the granule was tweaked, lets fail...

    The LOCK CMPXCHG is focusing on value comparison. It's a strong CAS in
    C++ std.

    The LL/SC if focusing on something else, the granule. It's a weak CAS in
    C++ std.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 29 17:03:49 2025
    From Newsgroup: comp.arch

    On 12/28/2025 2:20 PM, Chris M. Thomasson wrote:
    [...]
    Fwiw, I noticed that a certain compiler was implementing LOCK XADD with
    a LOCK CMPXCHG loop and got a little pissed. Had to tell them about it:

    read all when you get some free time to burn:

    https://forum.pellesc.de/index.php?topic=7167.msg27217#msg27217

    Using LOCK CMPXCHG (strong) we can directly implement state machines.
    Using weak CAS aka LL/SC, well, not so easy because of the damn spurious failures.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Dec 30 01:23:28 2025
    From Newsgroup: comp.arch

    On Sun, 21 Dec 2025 20:32:44 +0000, MitchAlsup wrote:
    On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:

    Or in other words, if you can decode K-instructions per cycle, you'd
    better be able to execute K-instructions per cycle--or you have a
    serious blockage in your pipeline.

    Not a typo--the part of the pipeline which is <dynamically> narrowest is
    the part that limits performance. I suggest strongly that you should not make/allow the decoder to play that part.

    I agree - and strongly, too - that the decoder ought not to be the part
    that limits performance.

    But what I quoted says that the execution unit ought not to be the part
    that limits performance, with the implication that it's OK if the decoder
    does instead. That's why I said it must be a typo.

    So I think you need to look a second time at what you wrote; it's natural
    for people to see what they expect to see, and so I think you looked at
    it, and didn't see the typo that was there.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Dec 29 19:54:41 2025
    From Newsgroup: comp.arch

    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Thomas Koenig wrote:
    Using a primitive Perl script to catch occurences, on a recent
    My 66000 cmopiler, of the shape

    [op] Ra,Ra,Rb
    [op] Ra,Rb,Ra
    [op] Ra,#n,Ra
    [op] Ra,Ra,#n
    [op] Ra,Rb

    where |n| < 32, which could be a reasonable approximation of a
    compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
    and 23.9% (GSL) of such instructions. Potential space savings
    would be a bit less than half that.

    Better compression schemes are certainly possible, but I think the
    disadvantages of having more complex encodings outweigh any
    potential savings in instruction size.

    The RISC-V people brag about how little their compressed encoding
    costs to decode; IIRC it's in the hundreds of something (not sure if transistors or gates). Of course, with superscalar decoding the
    compressed instruction set costs additional decoders plus logic to
    select which decodings do not belong to actual instructions, but
    that's true for any 16+32-bit encoding, however simple.


    It has its own hair:
    Multiple schemes for encoding immediate values and displacements;
    Multiple ways to encode register fields;
    Each type of Load/Store instruction effectively has its own displacement encoding;
    ...

    I am skeptical of it being the cheapest possible, or the best possible.

    But, not much viable reason to change it either:
    Main reason to have it is compatibility;
    Compatibility would be lost with any notable design change.



    Granted, had noted that the decoders for my own ISA are more expensive
    and have worse timing at present than the RISC-V decoders, but this
    includes XG1+XG2+XG3.

    Much simplification and cost reduction would be possible if XG1 and XG2
    were dropped. It is possible if I did a new core, I might consider
    making RISC-V the primary ISA with XG3 as the secondary ISA. Would
    likely keep much of the low-level architecture similar though, with some
    level of firmware-level wonk.

    Keeping XG3 around does still make some sense:
    Better performance than RISC-V;
    Better code density than RISC-V when under similar constraints;
    Has SIMD that isn't complicated and expensive.
    If I were to support RV-V,
    it is likely to be via traps or hot-patching.
    ...


    Pros/cons that standard Linux distros seem to assume EFI rather than
    direct hardware control in many cases.

    Where, EFI allows providing more abstraction, but couldn't really be fit
    into a 32K or 48K ROM space. I suspect I would more likely need upwards
    of 200K to pull this off if EFI were done in ROM, though could probably
    stick with the existing 32K ROM if its main purpose is to load an image
    from an SDcard or similar.

    On some boards (such as the Nexys 7) it is possible in theory to load
    both the CPU's bitstream and possibly the EFI firmware into an on-board
    QSPI Flash module and leave the SDcard mostly for the OS proper (vs
    generally having both the bitstream and possible BIOS on the SDcard).


    In some ways, firmware can hide that not even all of RV64G is
    implemented in hardware, because some parts either can't be implemented effectively, or don't make sense from a cost/benefit POV to support
    natively.


    So the % numbers you measured might just be coincidence and could be low.
    An ISA with both short 2- and long 3- register formats like RV where there >> is an incentive to do this optimization might provide stats confirmation.

    I have done the following on a RV64GC system with Fedora 33:

    objdump -d /lib64/lp64d/libperl.so.5.32|grep '^ *[0-9a-f]*:'|awk '{print length($2)}'|sort|uniq -c
    215782 4
    179493 8

    16-bit instructions are reported as 4 (4 hex digits), 32-bit
    instructions are reported as 8.

    If the actual binary /usr/bin/perl is meant, here's the stats for that:

    objdump -d /usr//bin/perl|grep '^ *[0-9a-f]*:'|awk '{print length($2)}'|sort|uniq -c
    105 4
    167 8

    gnuplot is not installed, and GSL is not installed, either, whatever
    it may be.

    Just to widen the basis, here are a few more:

    zstd:
    129569 4
    134985 8

    git:
    305090 4
    274053 8

    /usr/lib64/libc-2.32.so:
    142208 4
    113455 8

    So the percentage of 16-bit instructions is a lot higher than for the
    schemes that Thomas Koenig has looked at.


    In my own testing, was seeing usually around:
    60% 32-bit
    40% 16-bit
    Resulting in typically around a 20% reduction in code size (vs RV64G).

    At least, with a compiler that doesn't specifically tailor its code
    generation to favor RV-C (and/or code that fits RV-C's patterns).


    One usual downside is that to utilize a 16-bit ISA with a smaller
    register space, one needs to reuse registers more frequently, which then reduces ILP due to register conflicts. So, smaller code at the expense
    of worse performance.



    For XG1, it was possible to tune things for a higher percentage of
    16-bit ops. Though in this case it meant largely limiting things to the
    low 16 registers except in "higher register pressure" scenarios, but
    this negatively effects speed.

    So, say (for XG1):
    R2..R15 only: Used as the default scheme;
    Had 6 scratch registers, 7 callee save, 3 SPR.
    R16..R31: Enabled if register pressure exceeds a threshold;
    R32..R63: Only enabled under very high register pressure.

    The threshold between when to enable R16..R31 differed some based on optimization level (raised with size optimization, lowered with speed optimization). Threshold for R32..R63 needed to be kept higher, as much
    of the ISA only natively supported the first 32 GPRs (for other parts of
    the ISA, using the high 32 registers would require 64-bit encodings).


    As noted, size optimization favored size, performance optimization
    favors performance, and in some places they are at odds with each other.

    Ironically, even when the binary is dominated by 16-bit ops, the
    relative code-size reductions are modest; and a fixed-length 16-bit ISA
    can actually be worse here than a 16/32 ISA.


    Similar happens with RISC-V, except that ironically, the limitations of
    RV-C can negatively effect size optimization as well. It is almost like
    there is one place RV-C does well:
    Small leaf functions.
    Everywhere else, it is weaker.

    And RV-C is basically too weak/limited to be used by itself as a primary
    ISA (unlike either Thumb, or XG1's 16-bit ops).



    For XG2, it made sense to use a different scheme:
    R2..R31: Always enabled by default;
    R32..R63: Enabled for high register pressure.

    Despite XG2 having 64 GPRs, always enabling all 64 GPRs had a slight
    negative effect on both code-density and performance (mostly to making prologs/epilogs slightly larger on average, and increasing the size of
    the average stack frame).

    However, always using 32 GPRs was generally better for performance than
    using 32 GPRs sparingly.


    For XG3, it is basically a similar scheme to RV+JX, namely:
    Low/moderate pressure:
    Assume X and F are split (int/ptr on X side; FPU on F side);
    High pressure:
    Merge the spaces.

    Though, for XG3 vs RV+JX, it makes sense to use a lower threshold for
    XG3, and a higher threshold for RV+JX. This is because XG3 supports
    direct 6-bit register encodings, whereas RV+JX needs to use J21O prefixes.

    The split needs to be maintained for plain RV regardless of register
    pressure, since plain RV is incapable of handling non-default registers.


    Though, BGBCC can treat X5..X7 as "stomp registers" and can use them to
    fake instructions using registers across the divide (likewise for F0..F2
    on the FPR side). This kinda sucks, but was needed as a consequence of
    how BGBCC implemented its RV support.

    So, as far as the BGBCC's register allocator are concerned, only X8..X31
    and F8..F31 actually exist (well, with the added wonk that they are
    internally remapped to their "equivalents" in the XG1/XG2 register space).

    Decided to leave off mapping tables, and maybe an eventual TODO is to
    redo the register allocator in a way that is "not stupid".


    Another way to approach this question is to look at the current
    champion of fixed instruction width, ARM A64, consider those
    instructions (and addressing modes) that ARM A64 has and RISC-V does
    not have, and look at how often they are used, and how many RISC-V instructions are needed to replace them.

    In any case, code density measurements show that both result in
    compact code, with RV64GC having more compact code, and actually
    having the most compact code among the architectures present in all
    rounds of my measurements where RV64GC was present.


    In my own testing, XG1 can beat RV-C, but: They are "close enough".

    I am more in favor at this point of mostly avoiding 16-bit ops when
    possible as they have the downside of negatively effecting performance
    in many cases (in ways that are inherently unavoidable).

    Except for cases where size optimization is important (like, say, in the
    Boot ROM), but then could just use RV-C as "good enough".


    In most other cases, slightly better code density at the expense of some performance isn't an ideal tradeoff. Like, for programs loaded into RAM
    +/- some kB off the size of ".text" doesn't matter that much.



    And, if the goal is "shortest instruction count", 32/64/96 bit is a
    better bet. And, if performance is the goal, minimizing register use
    conflicts will also be a goal, which means prioritizing 64 or so GPRs,
    which can't really be used from a 16-bit encoding scheme.


    But code size is not everything. For ARM A64, you pay for it by the increased complexity of implementing these instructions (in particular
    the many register ports) and addressing modes. For bigger
    implementations, instruction combining means additional front-end
    effort for RISC-V, and then maybe similar implementation effort for
    the combined instructions as for ARM A64 (but more flexibility in
    selecting which instructions to combine). And, as mentioned above,
    the additional decoding effort.


    Yeah.

    A64 maybe has some issues in the other way, as some of the addressing
    modes are more complicated than ideal.

    Things like ALU status flags aren't free either.

    ...



    If I were to try to rank addressing modes in terms of use/frequency
    (assuming all exist):
    1. [Rb+Disp]
    2. [Rb+Ri*ElemSizedScale]
    3. [Rb+Ri*1]
    4. (Rb)+ //"*ptr++"
    5. [Abs] //"*((T*)FIXEDADDR)"
    6. [Rb+Ri*Sc+Disp] //"obj->arr[idx]"
    7. -(Rb) //"*--ptr"
    8. +(Rb) and/or (Rb)- //"*++ptr" and "*ptr--"
    9. [Abs+Rb*ElemSizedScale]
    10. [Abs+Rb*1]

    If the "[Rb+Disp]" were subdivided, one would have, say:
    1. [SP+Disp] //Prolog/Epilog/Spill
    2. [Rb] //"*ptr"
    3. [GP+Disp] //Global Variable
    4. [Rb+Disp] //eg: "obj->field", etc
    5. [TP+Disp] //Context / TLS (much rarer)

    Where, in this case, [GP+Disp] has the main special property than Disp
    tends to be larger (at least in my compiler) than with most other
    registers (mostly because GP is used to access global variables).

    If GP+Disp were not used to access globals, this could shift to:
    [PC+Disp] //if using PC-rel for globals
    [Abs] //if using absolute addressing.

    So, a lot does depend on compiler.


    Even if supported, and usable by the compiler, auto-increment seems
    uncommon. The dominant way it was used (in both SH and BJX1) was to
    implement PUSH/POP. If using SP-rel instead, this use-pattern mostly evaporates.

    While still potentially used for things like "*ptr++" and similar, these
    tend to themselves be relatively infrequent if compared with all the
    other places load/store may appear.


    Active usage frequency:
    Boot
    Load , [Rb+Disp] : 50%
    Store, [Rb+Disp] : 40%
    Load , [Rb+Ri*ElemScale]: 7%
    Store, [Rb+Ri*ElemScale]: 1%
    Everything Else : 2%
    Doom
    Load , [Rb+Disp] : 12%
    Store, [Rb+Disp] : 11%
    Load , [Rb+Ri*ElemScale]: 67%
    Store, [Rb+Ri*ElemScale]: 9%
    Everything Else : 1%



    When we look at actual implementations, RISC-V has not reached the
    widths that ARM A64 has reached, but I guess that this is more due to
    the current potential markets for these two architectures than due to technical issues. RISC-V seems to be pushing into server space
    lately, so we may see wider implementations in the not-too-far future.


    Possibly.

    Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.

    Major limitations here being more:
    Things like register forwarding cost have non-linear scaling;
    For an in-order machine, usable ILP drops off very rapidly;
    ...

    There seems to be a local optimum between 2 and 3.


    Say, for example, if one had an in-order machine with 5 ALUs, one would
    be hard pressed to find much code that could actually make use of the 5
    ALUs. One can sorta make use of 3 ALUs, but even then, the 3rd lane is
    more often useful for spare register ports and similar (with 3-wide ALU
    being a minority case)

    Apart from the occasionally highly unrolled and parallel integer code.

    ...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Dec 30 01:55:21 2025
    From Newsgroup: comp.arch

    On 12/29/2025 2:14 PM, Stefan Monnier wrote:
    I wonder if there have been other studies to explore other impacts
    such as run time, or cache miss rate.
    The difficulty there is standardising the input data, and normalising
    processor performance, memory bandwidth and latency, etc.

    I was thinking of those "compressed" variants of ISAs, such as Thumb,
    Thumb2, MIPS16e, microMIPS, or the "C" option of RISC-V, where you can compare with/without on the very same machine since all the half-size instructions are also available in full-size.


    Yep.


    Code segment size is much easier to measure.

    Yes, but!


    Code-size conflates several desirable properties:
    Space saving, reducing instruction counts, etc.

    But, in so doing, loses distinctiveness:
    Is the binary smaller due to a smaller number of bigger instructions, or
    a larger number of smaller instructions?...

    Smaller binary can be good, but a larger number of smaller instructions
    is less so.


    Like, say:
    Doom compiled to XG3 vs Doom compiled to SH-4...
    The size of ".text" isn't that much different.
    But, the SH-4 version has around 230% as many instructions.
    So, would perform significantly worse.


    Actually, kinda of funny the path I took (nearly a decade thus far):
    Started out with 16-bit instructions and 16 registers (and SH-4);
    Then went to 16/32 (BJX1-32);
    Then went 64-bit (at which, first attempt was a horrible mess);
    Then "simplified" it (BJX1-64C)
    (clean-ups and dropping stuff to free encoding space);
    Then creating a minimalist version of the 64-bit ISA (B64V)
    back to fixed-length 16-bit instructions;
    Then made a 32-bit version (B32V),
    and reworked the encoding (BTSR1);
    Then made it 64-bit again (BJX2);
    Re-added 32-bit instructions, but different this time;
    16-bit encodings, R0..R15, 32-bit R0..R31
    Then added 48 bit encodings;
    Added explicit parallelism (WEX) and Predication;
    Was encoded by 2 bits in each instruction;
    Then ended up making 32-bit encodings primary, rather than 16-bit;
    Then added jumbo prefixes, dropping the 48 bit encodings;
    Then added SIMD;
    Then started expanding to 64 GPRs (XGPR);
    Then added RISC-V decoder support;
    Created an ISA variant that goes fully 64 GPR (XG2),
    at expense of 16-bit ops;
    In basic cases, 32-bit encodings are common with its predecessor.
    But, some bit-twiddly dog-chew.
    Starts to note RISC-V and GCC are not a "silver bullet"
    Seemingly RV+GCC doing well being Dhrystone;
    Ported a few RV features to my ISA, to solidly regain perf lead.
    But, RV still has some merits, even if not the best perf.
    Makes my compiler target RISC-V as well;
    Experiments with some extensions, improving perf.
    Tried gluing a lot of features from my ISA onto RISC-V;
    Excluding predication, no real way to make this work as-is.
    Makes a new ISA variant that glues both ISAs together (XG3),
    in the same encoding space, sacrificing WEX.
    Predication can still be encoded, but demoted to optional.
    Depends on arch state that doesn't formally exist in RV.


    Then, say, XG3 is pretty much unrecognizable if compared with SH-4.

    Say:
    SH-4:
    16-bit instructions, 16 registers, 32-bit word size;
    XG3:
    32/64/96 bit instructions;
    64 registers;
    64-bit word size.

    Register Space:
    SH-4:
    R0..R3: Scratch
    R4..R7: Arg1..Arg4 / Scratch
    R8..R14: Callee Save
    R15: SP
    Prototypical instruction: zzzz-nnnn-mmmm-zzzz
    XG2:
    R0 / R1: Dedicated Stomp Regs
    R2 / R3: Scratch
    R4..R7: Arg1..Arg4 / Scratch
    R8..R14: Callee Save
    R15: SP
    R16..R19: Scratch
    R20..R23: Arg5..Arg8 / Scratch
    R24..R31: Callee Save
    R32..R35: Scratch
    R36..R39: Arg9..Arg12 / Scratch
    R40..R47: Callee Save
    R48..R51: Scratch
    R52..R55: Arg13..Arg16 / Scratch
    R56..R63: Callee Save
    Prototypical instruction: NMOP-xwxx-nnnn-mmmm,yyyy-qnmo-oooo-zzzz
    XG3 (and RV):
    R0: ZR / Zero
    R1: LR / RA
    R2: SP
    R3: GBR / GP
    R4: TP
    R5..R7: De-Facto Stomp
    R8/R9: Callee Save
    R10..R17: Arg1..Arg8 / Scratch
    R18..R27: Callee Save
    R28..R31: Scratch
    R32..R63 == F0..F31
    F0..F3: Stomp or Scratch (Stomp for RV)
    F4..F7: Scratch or Callee Save (ABI)
    F8/F9: Callee Save
    F10..F17: Scratch or Arg9..Arg16 (ABI)
    F18..F27: Callee Save
    F28..F31: Scratch
    Prototypical instruction:
    XG3: zzzzoooooommmmmmyyyynnnnnnqxxx10
    RV : zzzzzzzooooommmmmyyynnnnnxxxxx11


    The stomp regs are functionally scratch registers, but may not be used
    by the main part of the compiler to hold live values, as they are
    reserved for the assembler stage to be able to stomp them without
    warning when synthesizing pseudo-instructions.

    In the transition from SH-4 to what became BJX2, the functionality of MACL/MACH and PTEL/PTEH and similar was all folded into R0 and R1, which
    were given the names DLR and DHR.

    Also renamed PR to LR (basically the same as RV's RA).
    Functionally, PR/RA/LR are all treated as aliases for the same register.
    I used LR as pretty much everything calls it a "Link Register" so no
    obvious reason IMO to not call it LR.


    Despite looking very different, XG3 instructions are mostly the same as
    XG2 instructions but with a lot of the bits moved around and some other
    fairly modest tweaks to the decoding rules (mostly to make encoding
    rules appear consistent across instruction types). The new layout was
    also made to look more cohesive with RV's layout, even if the fields are
    in different places and different sizes.

    Can note that XG2 and XG3 have fewer opcode bits than RISC-V, but
    seemingly I didn't burn through encoding space at quite the same rate.

    Though, unlike RISC-V, I also have a big pile of 2R instructions (which
    use comparably less encoding space).

    ...



    Stefan

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Dec 30 07:36:44 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller
    register space, one needs to reuse registers more frequently, which then >reduces ILP due to register conflicts. So, smaller code at the expense
    of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to
    use the uncompressed instruction. So you may tune your RISC-V
    compiler to prefer registers r8-r15 for those pseudo-registers that
    occur in instructions where such a register allocation may lead to a
    compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of OoO implementations. On the contrary, write-after-read may be beneficial
    by releasing the old physical register for the register name. And
    designing a compressed CPU instruction set for in-order processing is
    not a good idea for general-purpose computing.

    Things like ALU status flags aren't free either.

    Yes, they cost their own renaming resources.

    Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.

    Major limitations here being more:
    Things like register forwarding cost have non-linear scaling;
    For an in-order machine, usable ILP drops off very rapidly;
    ...

    ILP is a property of a program. I assume that what you mean is that
    the IPC benefits of more width have quickly diminishing returns on
    in-order machines.

    There seems to be a local optimum between 2 and 3.


    Say, for example, if one had an in-order machine with 5 ALUs, one would
    be hard pressed to find much code that could actually make use of the 5 >ALUs. One can sorta make use of 3 ALUs, but even then, the 3rd lane is
    more often useful for spare register ports and similar (with 3-wide ALU >being a minority case)

    We have some interesting case studies: The Alpha 21164(a) and the ARM Cortex-A53 and A55. They all are in-order designs, their number of
    functional units are pretty similar, and, in particular, they all have
    2 integer ALUs. But the 21164 can decode and execute 4 instructions
    per cycle, while the Cortex-A53 and A55 are only two-wide. My guess
    is that this is due to the decoding cost of ARM A32/T32 and A64
    (decoders for two instruction sets, one of which has 16-bit and 32-bit instructions).

    The Cortex-A55 was succeeded by the A510, which is three-wide, and
    that was succeeded by the A520, which is three-wide with two ALUs and
    supports only ARM A64.

    Widening the A510, which still supports both instruction sets is
    (weak) counterevidence for my theory about why A53/A55 are only
    two-wide at decoding. The fact that the A520 returns to two integer
    ALUs indicates that the third integer ALU provides little IPC benefit
    in an in-order design.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Dec 30 08:30:08 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    My 66000 does not have a TSO memory system, but when one of these
    things shows up, it goes sequential consistency, and when it is done
    it flips back to causal consistency.

    TSO is cycle-wasteful.

    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Dec 30 10:01:11 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    BTW, when discussing ISA compactness, I usually see it measured by
    comparing the size of the code segment in typical executables.

    I understand that it's as good a measure as any and it's one that's
    fairly easily available, but at the same time it's not necessarily one
    that actually matters since I expect that it affects a usually fairly
    small proportion of the total ROM/Flash/disk space.

    I wonder if there have been other studies to explore other impacts such
    as run time, or cache miss rate.

    Optimizing compilers also play a large role. Very many
    optimizations increase code size. A few examples include loop
    unrolling, aligning, inlining and function cloning. (For those
    not familiar with the term: Assume you have a function that is
    called with constant arguments; the compiler might want to make a
    copy of that function with a certain constant parameters, if this
    can lead to simplification and thus further optimization of code
    on a hot path).

    The resulting code size increases might then cause problems with
    icache pressure, which is why some people report faster execution
    with lower optimization levels with some code.

    I'm not aware of any compiler which has an icache model, but this
    would also be very hard (presumably).

    The only fair comparision, I think, is with -Os.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Dec 30 09:32:05 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    TSO is cycle-wasteful.

    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf

    "All swans are white" has been "experimentally verified" by finding one
    white swan. The existence of black swans shows that such
    "experimental verifications" are fallacies.

    Actually the existence of a weak-memory-model mode on specific
    hardware makes it very likely that TSO is slower than the weak model
    on that hardware. If TSO was implemented to provide the same speed,
    there would be no need for also providing a weaker memory ordering
    mode on that hardware.

    Similarly, the introduction of the TRAPB instruction on the Alpha, and
    the fact that using it through -mieee-fp slows down execution on the 21064-21164A could be construed as "experimental verification" of the
    claims "IEEE FP (in particular denormal numbers) is cycle-wasteful"
    and "precise FP exceptions are cycle-wasteful". Then the black swan
    (21264) appeared, where TRAPB is a noop, and it outperformed all
    earlier Alphas.

    Note that "some Nvidia and Fujitsu [ARM architecture] implementations
    run with TSO at all times" <https://lwn.net/Articles/970907/>. Given
    that the Fujitsu implementations of the ARM architecture are used in supercomputers, it is unlikely that their TSO implementation is
    cycle-wasteful.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 30 12:09:26 2025
    From Newsgroup: comp.arch

    On Tue, 30 Dec 2025 07:36:44 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller
    register space, one needs to reuse registers more frequently, which
    then reduces ILP due to register conflicts. So, smaller code at the
    expense of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to
    use the uncompressed instruction. So you may tune your RISC-V
    compiler to prefer registers r8-r15 for those pseudo-registers that
    occur in instructions where such a register allocation may lead to a compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of OoO implementations. On the contrary, write-after-read may be beneficial
    by releasing the old physical register for the register name. And
    designing a compressed CPU instruction set for in-order processing is
    not a good idea for general-purpose computing.

    Things like ALU status flags aren't free either.

    Yes, they cost their own renaming resources.

    Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.

    Major limitations here being more:
    Things like register forwarding cost have non-linear scaling;
    For an in-order machine, usable ILP drops off very rapidly;
    ...

    ILP is a property of a program. I assume that what you mean is that
    the IPC benefits of more width have quickly diminishing returns on
    in-order machines.

    There seems to be a local optimum between 2 and 3.


    Say, for example, if one had an in-order machine with 5 ALUs, one
    would be hard pressed to find much code that could actually make use
    of the 5 ALUs. One can sorta make use of 3 ALUs, but even then, the
    3rd lane is more often useful for spare register ports and similar
    (with 3-wide ALU being a minority case)

    We have some interesting case studies: The Alpha 21164(a) and the ARM Cortex-A53 and A55. They all are in-order designs, their number of functional units are pretty similar, and, in particular, they all have
    2 integer ALUs. But the 21164 can decode and execute 4 instructions
    per cycle, while the Cortex-A53 and A55 are only two-wide. My guess
    is that this is due to the decoding cost of ARM A32/T32 and A64
    (decoders for two instruction sets, one of which has 16-bit and 32-bit instructions).

    The Cortex-A55 was succeeded by the A510, which is three-wide, and
    that was succeeded by the A520, which is three-wide with two ALUs and supports only ARM A64.

    Widening the A510, which still supports both instruction sets is
    (weak) counterevidence for my theory about why A53/A55 are only
    two-wide at decoding. The fact that the A520 returns to two integer
    ALUs indicates that the third integer ALU provides little IPC benefit
    in an in-order design.

    - anton


    Do you happen to have benchmarks that compare performance of Alpha EV5
    vs in-order Cortex-A ?




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 30 13:05:22 2025
    From Newsgroup: comp.arch

    On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    My 66000 does not have a TSO memory system, but when one of these
    things shows up, it goes sequential consistency, and when it is done
    it flips back to causal consistency.

    TSO is cycle-wasteful.

    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf


    WOW, they wrote article of 7 pages without even one time mentioning
    avoidance of RFO (read for ownership) which is an elephant in the room
    of discussion of advantages of Arm MOM/MCM over TSO.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 30 13:16:55 2025
    From Newsgroup: comp.arch

    On Tue, 30 Dec 2025 09:32:05 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    TSO is cycle-wasteful.

    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf


    "All swans are white" has been "experimentally verified" by finding
    one white swan. The existence of black swans shows that such
    "experimental verifications" are fallacies.

    Actually the existence of a weak-memory-model mode on specific
    hardware makes it very likely that TSO is slower than the weak model
    on that hardware. If TSO was implemented to provide the same speed,
    there would be no need for also providing a weaker memory ordering
    mode on that hardware.

    Similarly, the introduction of the TRAPB instruction on the Alpha, and
    the fact that using it through -mieee-fp slows down execution on the 21064-21164A could be construed as "experimental verification" of the
    claims "IEEE FP (in particular denormal numbers) is cycle-wasteful"
    and "precise FP exceptions are cycle-wasteful". Then the black swan
    (21264) appeared, where TRAPB is a noop, and it outperformed all
    earlier Alphas.

    Note that "some Nvidia and Fujitsu [ARM architecture] implementations
    run with TSO at all times" <https://lwn.net/Articles/970907/>. Given
    that the Fujitsu implementations of the ARM architecture are used in supercomputers, it is unlikely that their TSO implementation is cycle-wasteful.

    - anton

    Even if I agree in parts of what you are saying, NVIDIA and Fujitsu ARM implementations are not good counter-examples.
    Their per-core performance, esp. scalar per-core performance, is not in
    the same league with top guys, like Apple and Qualcomm. More so, it's
    not even in the same league with 2nd rate, like big cores of ARM Inc.
    Most likely not even in the same league as middle cores of ARM Inc. Particularly, in the case of Fujitsu all levels of memory
    hierarchy have high latency, not just when measured in nsec, but, for
    inner level, even when measured in clocks.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Dec 30 11:13:37 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    Do you happen to have benchmarks that compare performance of Alpha EV5
    vs in-order Cortex-A ?

    LaTeX benchmark results (lower is waster)

    Alpha:
    - 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5) 8.1
    ARM A32/T32:
    - Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
    ARM A64:
    - Rockpro64 (1416MHz Cortex A53) Debian 9 (Stretch) 3.24
    - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
    - Odroid C2 (1536MHz Cortex A53) Ubuntu 16.04 2.32
    - Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended) 2.105

    A problem with the LaTeX benchmark is that it performance is
    significantly influenced by the LaTeX installation (newer versions
    need more instructions, and having more packages needs more
    instructions). But it's the only benchmark results I have.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Dec 30 12:59:38 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    TSO is cycle-wasteful.

    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf

    "All swans are white" has been "experimentally verified" by finding one
    white swan. The existence of black swans shows that such
    "experimental verifications" are fallacies.

    If you have any counter-examples, please feel free to cite them.

    Actually the existence of a weak-memory-model mode on specific
    hardware makes it very likely that TSO is slower than the weak model
    on that hardware. If TSO was implemented to provide the same speed,
    there would be no need for also providing a weaker memory ordering
    mode on that hardware.

    That makes little sense. TSO fulfulls all the requirements of the
    ARM memory model, and adds some on top. It is possible to create
    an ARM CPU which uses TSO, as you wrote below. If that had
    the same performance as a CPU running on the pure ARM memory model,
    there would be no reason to implement the ARM memory model at all.

    So, two alternatives:

    a) Apple engineers did not have the first clue what they were doing
    b) Apple engineers knew what they were doing

    Given that Apple silicon seems to be competently done, I personally
    think that option a) is the better one. You obviously prefer
    option b).

    Similarly, the introduction of the TRAPB instruction on the Alpha, and
    the fact that using it through -mieee-fp slows down execution on the 21064-21164A could be construed as "experimental verification" of the
    claims "IEEE FP (in particular denormal numbers) is cycle-wasteful"
    and "precise FP exceptions are cycle-wasteful". Then the black swan
    (21264) appeared, where TRAPB is a noop, and it outperformed all
    earlier Alphas.

    The particular case in question was comparision of two modes of
    operation on the same hardware. So, if were to compare hardware
    float and soft float on the 21264, that would be equivalent to the
    pint above.

    Note that "some Nvidia and Fujitsu [ARM architecture] implementations
    run with TSO at all times" <https://lwn.net/Articles/970907/>. Given
    that the Fujitsu implementations of the ARM architecture are used in supercomputers, it is unlikely that their TSO implementation is cycle-wasteful.

    So, you're saying that Apple is clueless, and that Nvidia and Fujitsu
    make that decision solely on the basis of speed? If you have any
    source for that, apart from your own guesses, I would be interested
    to know what it is. (Preferring TSO can also have other reasons,
    such as having software which depends on it).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 30 17:41:27 2025
    From Newsgroup: comp.arch

    On Tue, 30 Dec 2025 11:13:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    Do you happen to have benchmarks that compare performance of Alpha
    EV5 vs in-order Cortex-A ?

    LaTeX benchmark results (lower is waster)

    Alpha:
    - 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5) 8.1
    ARM A32/T32:
    - Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
    ARM A64:
    - Rockpro64 (1416MHz Cortex A53) Debian 9 (Stretch) 3.24
    - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04
    2.488
    - Odroid C2 (1536MHz Cortex A53) Ubuntu 16.04 2.32
    - Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended)
    2.105

    A problem with the LaTeX benchmark is that it performance is
    significantly influenced by the LaTeX installation (newer versions
    need more instructions, and having more packages needs more
    instructions). But it's the only benchmark results I have.

    - anton

    Thank you.
    Two 64-bit A53 results about the same as EV5 clock for clock and one
    result is significantly better.
    So, either wide in-order is indeed not bright idea or 21164 suffers
    because of inferioriority of Alpha ISA relatively to ARM64.

    BTW, Odroid C2 score appears suspiciously good. Could it be a turbo
    clock frequency was much higher than reported?


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Dec 30 10:44:10 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    My 66000 does not have a TSO memory system, but when one of these
    things shows up, it goes sequential consistency, and when it is done
    it flips back to causal consistency.

    TSO is cycle-wasteful.
    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf


    WOW, they wrote article of 7 pages without even one time mentioning
    avoidance of RFO (read for ownership) which is an elephant in the room
    of discussion of advantages of Arm MOM/MCM over TSO.

    What is this "avoidance of RFO"?
    I can find no mention of it anywhere.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 30 18:05:40 2025
    From Newsgroup: comp.arch

    On Tue, 30 Dec 2025 10:44:10 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    My 66000 does not have a TSO memory system, but when one of these
    things shows up, it goes sequential consistency, and when it is
    done it flips back to causal consistency.

    TSO is cycle-wasteful.
    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf


    WOW, they wrote article of 7 pages without even one time mentioning avoidance of RFO (read for ownership) which is an elephant in the
    room of discussion of advantages of Arm MOM/MCM over TSO.

    What is this "avoidance of RFO"?
    I can find no mention of it anywhere.

    Imagine code that overwrites the whole cache line that core initially
    did not own. Under TSO rules, like x86, the only [non heroic] ways to
    overwrite the line without reading its previous content (which could
    easily mean reading from DRAM) are
    - aligned AVX512 store
    - rep movs/rep stos

    With weaker MOM the core has option of delayed merging of multiple
    narrow stores. I think that even relatively old ARM cores, like Neoverse
    N1, are able to do it.

    I can imagine heroic microarchitecture that achieves the same effect
    with TSO, but it seems that so far nobody did it.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Dec 30 17:15:59 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    Imagine code that overwrites the whole cache line that core initially
    did not own. Under TSO rules, like x86, the only [non heroic] ways to >overwrite the line without reading its previous content (which could
    easily mean reading from DRAM) are
    - aligned AVX512 store
    - rep movs/rep stos

    Any sequence of stores without intervening loads can be turned into
    one store under sequential consistency, and therefore also under the
    weaker TSO. Doing that for a sequence that stores into one cache line
    does not appear particularly heroic to me. The question is how much
    benefit one gets from this optimization.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Dec 30 12:59:28 2025
    From Newsgroup: comp.arch

    Anton Ertl [2025-12-30 17:15:59] wrote:
    Michael S <already5chosen@yahoo.com> writes:
    Imagine code that overwrites the whole cache line that core initially
    did not own. Under TSO rules, like x86, the only [non heroic] ways to >>overwrite the line without reading its previous content (which could
    easily mean reading from DRAM) are
    - aligned AVX512 store
    - rep movs/rep stos
    Any sequence of stores without intervening loads can be turned into
    one store under sequential consistency, and therefore also under the
    weaker TSO. Doing that for a sequence that stores into one cache line
    does not appear particularly heroic to me. The question is how much
    benefit one gets from this optimization.

    But the stores may be interleaved with loads from other locations!
    It's quite common to have a situation where a sequence of stores
    initializes a new object and thus overwrites a complete cache line, but
    that initialization sequence needs to read from memory (e.g. from the
    stack).

    Maybe compilers can be taught to group such writes to try and avoid
    the problem?


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Dec 30 18:00:28 2025
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 12/28/2025 4:41 PM, BGB wrote:
    [...]

    Also, if using something like LOCK CMPXCHG you MUST make sure to align
    and pad your relevant data structures to a l2 cache line.

    That may not be necessary if there is otherwise no false sharing in
    the same cache line. Yes, the operand should be naturally aligned,
    (which ensures it is entirely contained within a single cache line),
    but there's no reason that other data cannot be stored in the same
    cache line, so long as it is unlikely to be accessed by a competing
    thread.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Dec 30 17:27:22 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    TSO is cycle-wasteful.

    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf

    "All swans are white" has been "experimentally verified" by finding one
    white swan. The existence of black swans shows that such
    "experimental verifications" are fallacies.

    If you have any counter-examples, please feel free to cite them.

    The fact that nobody had counterexamples for the theory "All swans are
    white" for a long while did not make that theory true.

    But I have actually seen black swans in Australia and elsewhere.
    Here's a picture <https://en.wikipedia.org/wiki/File:Black_Swan_in_Flight_Crop.jpg>

    Actually the existence of a weak-memory-model mode on specific
    hardware makes it very likely that TSO is slower than the weak model
    on that hardware. If TSO was implemented to provide the same speed,
    there would be no need for also providing a weaker memory ordering
    mode on that hardware.

    That makes little sense. TSO fulfulls all the requirements of the
    ARM memory model, and adds some on top. It is possible to create
    an ARM CPU which uses TSO, as you wrote below. If that had
    the same performance as a CPU running on the pure ARM memory model,
    there would be no reason to implement the ARM memory model at all.

    Exactly. So you will only have a TSO mode and a weak mode, if, for
    your particular implementation, the weak mode provides a performance
    advantage over the TSO mode. So it is unlikely that you ever see an implementation with such a mode bit where TSO mode has the same
    performance, unless the mode bit is there only for backwards
    compatibility and actually has no effect (what TRAPB turned into on
    the 21264).

    We have ARM implementations with TSO without mode bit. Do you really
    need to add a mode bit to them that has no effect and therefore
    produces 0% difference in performance to accept them as
    counterexample?

    So, two alternatives:

    a) Apple engineers did not have the first clue what they were doing
    b) Apple engineers knew what they were doing

    My theory is that Apple engineers first implemented the weak model,
    because that's what the specification says, and they were not tasked
    to implement something better; there is enough propaganda for weak
    memory models around that a hardware engineer might think that that's
    the way to go. Later (for the M1) the Apple hardware designers were
    asked to implement TSO, and they did not redo the whole memory model
    from the ground up, but by doing relatively small changes. And of
    course the result is that their TSO mode is slower than their weak
    mode, just as -mieee-float on a 21164 is slower than code compiled
    without that mode.

    Given that Apple silicon seems to be competently done

    The 21164 was the fastest CPU during some of its days. And yet the
    21264 was faster without needing TRAPB.

    I personally
    think that option a) is the better one. You obviously prefer
    option b).

    An absurdity typical of you.

    Note that "some Nvidia and Fujitsu [ARM architecture] implementations
    run with TSO at all times" <https://lwn.net/Articles/970907/>. Given
    that the Fujitsu implementations of the ARM architecture are used in
    supercomputers, it is unlikely that their TSO implementation is
    cycle-wasteful.

    So, you're saying that Apple is clueless, and that Nvidia and Fujitsu
    make that decision solely on the basis of speed?

    I did not write anything about the clue of Apple. I don't know much
    about the CPUs by Nvidia and Fujitsu. But if there was significant
    performance to be had by adding a weakly-ordered mode, wouldn't
    especially Fujitsu with its supercomputer target have done it?

    If hardware designers get tasked with implementing TSO (or sequential consistency) with modern transistor budgets, they hopefully come up
    with different solutions to various problems than if they are tasked
    to do a weak model with a slow TSO option.

    And the example of the 21264 shows that a solution without a kludgy
    interface (TRAPB in case of precise exceptions; for weak memory models
    the special memory access instructions and memory barriers/fences are
    the kludgy interface) can perform better. The 21264 does have a
    higher hardware cost than the 21164, however.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Dec 30 13:28:16 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 30 Dec 2025 10:44:10 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    My 66000 does not have a TSO memory system, but when one of these
    things shows up, it goes sequential consistency, and when it is
    done it flips back to causal consistency.

    TSO is cycle-wasteful.
    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf

    WOW, they wrote article of 7 pages without even one time mentioning
    avoidance of RFO (read for ownership) which is an elephant in the
    room of discussion of advantages of Arm MOM/MCM over TSO.
    What is this "avoidance of RFO"?
    I can find no mention of it anywhere.

    Imagine code that overwrites the whole cache line that core initially
    did not own. Under TSO rules, like x86, the only [non heroic] ways to overwrite the line without reading its previous content (which could
    easily mean reading from DRAM) are
    - aligned AVX512 store
    - rep movs/rep stos

    With weaker MOM the core has option of delayed merging of multiple
    narrow stores. I think that even relatively old ARM cores, like Neoverse
    N1, are able to do it.

    I can imagine heroic microarchitecture that achieves the same effect
    with TSO, but it seems that so far nobody did it.

    I don't see how a ReadForOwnership message can be avoided as it
    transfers two things: the ownership state, and the current line data.
    Even if the core knows the whole cache line is being overwritten and
    doesn't need the line data, it still needs the Owned state transfer.
    There would still be a request message, say TakeOwner TKO which
    has a smaller reply GiveOwner GVO message and just moves the state.
    So the reply is a few less flits.

    As I understand it...
    Independent of the ReadForOwnership message, the ARM weak coherence model should allow stores to other cache lines to proceed, whereas TSO would
    require younger stores to (appear to) wait until the older store completes.
    The weak coherence model allows the cache to use hit-under-miss for
    stores because it doesn't require the store order to different locations
    be seen in program order. This allows it to overlap younger store cache
    hits with the older ReadForOwnership message, not eliminate it.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Dec 30 18:27:14 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 30 Dec 2025 11:13:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    Do you happen to have benchmarks that compare performance of Alpha
    EV5 vs in-order Cortex-A ?

    LaTeX benchmark results (lower is waster)

    Alpha:
    - 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5) 8.1
    ARM A32/T32:
    - Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
    ARM A64:
    - Rockpro64 (1416MHz Cortex A53) Debian 9 (Stretch) 3.24
    - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04
    2.488
    - Odroid C2 (1536MHz Cortex A53) Ubuntu 16.04 2.32
    - Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended)
    2.105

    A problem with the LaTeX benchmark is that it performance is
    significantly influenced by the LaTeX installation (newer versions
    need more instructions, and having more packages needs more
    instructions). But it's the only benchmark results I have.

    - anton

    Thank you.
    Two 64-bit A53 results about the same as EV5 clock for clock and one
    result is significantly better.
    So, either wide in-order is indeed not bright idea or 21164 suffers
    because of inferioriority of Alpha ISA relatively to ARM64.

    Hard to tell from these results. In addition to the problems
    mentioned above there are also differences in cache configuration to
    consider. And the A55 does quite a bit better in IPC than the A53,
    although it superficially has the same resources.

    The benchmark may be worse for the Alpha than many other benchmarks
    because LaTeX probably uses many byte accesses, but the used binary
    does not use BWX instructions (the 21164A has them, but there has been
    no Redhat for Alpha with BWX). This can be seen as an inferiority of
    the Alpha ISA relative to ARM A64.

    I think all these problems make these results too fuzzy to draw any
    conclusions about the question of interest. You would have to run a
    larger set of more well-defined benchmarks to answer this question.

    But I think that one can say that the performance-per-clock is roughly
    similar between the 21164 and the Cortex-A53.

    BTW, Odroid C2 score appears suspiciously good. Could it be a turbo
    clock frequency was much higher than reported?

    No. The cycles and GHz number comes out of perf. And the Odroid C2
    does not reach higher clocks without overclocking, and does not reach
    the promised 2GHz even with overclocking, as the manufacturer had to
    admit.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Dec 30 13:10:11 2025
    From Newsgroup: comp.arch

    On 12/29/2025 1:55 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 12/28/2025 5:53 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>>>>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for >>>>>>>>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>>>> bits is going on.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>>>>>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember
    something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>>>
    The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>>>> that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>>>
    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>>>> "style" loop, well, that causes damage to my loopless claim... ;^o >>>>>>
    So, can your system get wait free semantics for RMW atomics?

    A::

    ATOMIC-to-Memory-size [address]
    ADD Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>>>> out over the memory hierarchy. When the operation runs into a cache
    containing [address] in the writeable-state the add is performed and >>>>> the previous value returned. If [address] is not writeable the cache >>>>> line in invalidated and the search continues outward. {This protocol >>>>> depends on writeable implying {exclusive or modified} which is typical.} >>>>>
    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under
    contention. So you DON"T DO IT LIKE THAT.

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
    Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    IMHO:
    No-Cache + CAS is probably a better bet than LL/SC;
    LL/sC: Depends on the existence of explicit memory-coherency features.
    No-Cache + CAS: Can be made to work independent of the underlying memory
    model.

    Granted, No-Cache is its own feature:
    Need some way to indicate to the L1 cache that special handling is
    needed for this memory access and cache line (that it should not use a
    previously cached value and should be flushed immediately once the
    operation completes).


    But, No-Cache behavior is much easier to fake on a TSO capable memory
    subsystem, than it is to accurately fake LL/SC on top of weak-model
    write-back caches.

    My 66000 does not have a TSO memory system, but when one of these
    things shows up, it goes sequential consistency, and when it is done
    it flips back to causal consistency.

    TSO is cycle-wasteful.

    But, yeah, was not arguing for using TSO here, rather noting that if one
    has it, then No-Cache can be ignored for CAS.


    But, then again, weak model is cheaper to implement and generally
    faster, although explicit synchronization is annoying and such a model
    is incompatible with "lock free data structures" (which tend to
    implicitly assume that memory accesses occur in the same order as
    written and that any memory stores are immediately visible across threads).


    But, then again, one is left with one of several options:
    Ask that people use a mutex whenever accessing any resource that may be modified between threads and where such modifications are functionally important;
    Or, alternatively, use a message passing scheme, where message passing
    can potentially be done more cheaply than a mutex (can be done in
    premise using No-Cache memory rather than needing an L1 flush).

    Well, or use write-through caching, which would mostly have the merit of allowing for cheaper cache flushes (at the expense of being slower in
    general than write-back caching).


    In the case of RISC-V, there are the FENCE and FENCE.I instructions.
    In my implementation, they are user-land only, and need to be
    implemented as, say:
    FENCE traps, and then performs an L1 flush.
    FENCE.I traps, and then performs an L1 flush,
    then also flushes the I$.

    There is CBO, which allows:
    CBO.FLUSH Reg
    Where, Reg gives the address of a cache line, which is then flushed from
    the L1 D$. This at least mapped over.

    There were CBO.INV and CBO.CLEAN, these did not map over exactly, and in
    my case have been implemented as equivalent to FLUSH. INV could be theoretically distinct in that it would discard unwritten cache lines,
    but this wasn't really a thing IME, and making it equivalent to flush
    sort of works in a "how do you prove that a preceding store wouldn't
    have been written to RAM before reaching the INV?". CLEAN is sort of a
    "reload address from RAM ignoring if the cache line was dirty.", also
    was not a thing in my case.

    For whatever reason, FENCE and FENCE.I were given full Imm12 encodings,
    which is horribly wasteful. They were then defined as Rd=Rs1=Imm=0.
    Decided to make it for FENCE.I so that Rd=0, Rs1!=0, Imm12 then behaves
    like the CBO ops (so, effectively allowing for a "CBO.FLUSH.I Reg" to
    flush a line in the I$).


    Ended up adding an INVTLB instruction, but had put this in as a
    non-standard instruction in the same general area used for ECALL/EBREAK/RTE/etc.

    Potentially, could make sense to add an "INVTLB Reg" instruction for "Invalidate TLBE associated with this page rather than the whole TLB".
    Had mostly dealt with this case by forging a dummy TLBE for this address
    and then doing multiple LDTLBs in a row, which is kind of a crap way to
    do it; but this is infrequent (mostly needed when a page has been
    modified in the page table, and the handler code suspects the page is
    still in the TLB).


    Currently, does involve the questionable thing of maintaining a mock-up
    of the TLB in RAM that code can use to check this stuff, but this is not ideal. Unlike on the SH though, there is no MMIO mechanism to access the
    TLB, nor any way for the CPU to inspect its contents directly. So, errm, software mock-up it is. This does constrain the types of behaviors
    allowed by the MMU though.

    Adding the LLPTLB idea would complicate things, as any entry where the page-table in question is present in the LLPTLB may need to be assumed
    as potentially in the TLB (or, in effect, needing to evict the page
    regardless of whether it is in the software view of the TLB). Though,
    maybe a naive "always assume page is potentially in the TLB" is the
    generally safer option here.

    ...


    If the memory system implements TSO or similar, then one can simply
    ignore the No-Cache behavior and achieve the same effect.

    ..



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Dec 30 14:12:32 2025
    From Newsgroup: comp.arch

    John Savard wrote:
    On Sun, 21 Dec 2025 20:32:44 +0000, MitchAlsup wrote:
    On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:

    Or in other words, if you can decode K-instructions per cycle, you'd
    better be able to execute K-instructions per cycle--or you have a
    serious blockage in your pipeline.

    Not a typo--the part of the pipeline which is <dynamically> narrowest is
    the part that limits performance. I suggest strongly that you should not
    make/allow the decoder to play that part.

    I agree - and strongly, too - that the decoder ought not to be the part
    that limits performance.

    But what I quoted says that the execution unit ought not to be the part
    that limits performance, with the implication that it's OK if the decoder does instead. That's why I said it must be a typo.

    So I think you need to look a second time at what you wrote; it's natural for people to see what they expect to see, and so I think you looked at
    it, and didn't see the typo that was there.

    John Savard

    There are two kinds of stalls:
    stalls in the serial front end I-cache, Fetch or Decode stages because
    of *too little work* (starvation due to input latency),
    and stalls in the back end Execute or Writeback stages because
    of *too much work* (resource exhaustion).

    The front end stalls inject bubbles into the pipeline,
    whereas back end stalls can allow younger bubbles to be compressed out.
    If I have to stall, I want it in the back end.

    It has to do with catching up after a stall.
    If a core stalls for 3 clocks, then in order to average 1 IPC
    it must retire 2 instructions per clock for the next 3 clocks.
    And it can only do that if it has a backlog of work ready to execute.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 30 20:03:46 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 30 Dec 2025 10:44:10 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    My 66000 does not have a TSO memory system, but when one of these
    things shows up, it goes sequential consistency, and when it is
    done it flips back to causal consistency.

    TSO is cycle-wasteful.
    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf


    WOW, they wrote article of 7 pages without even one time mentioning avoidance of RFO (read for ownership) which is an elephant in the
    room of discussion of advantages of Arm MOM/MCM over TSO.

    What is this "avoidance of RFO"?
    I can find no mention of it anywhere.

    Imagine code that overwrites the whole cache line that core initially
    did not own. Under TSO rules, like x86, the only [non heroic] ways to overwrite the line without reading its previous content (which could
    easily mean reading from DRAM) are
    - aligned AVX512 store
    - rep movs/rep stos

    In My 66000 the MM and MS instructions are allowed to CI and immediately
    start overwriting a cache line when the entire cache line is going to be overwritten. Core does not have to wait for the line to become resident
    and writeable.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 30 20:08:41 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Michael S wrote:
    On Tue, 30 Dec 2025 10:44:10 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    My 66000 does not have a TSO memory system, but when one of these
    things shows up, it goes sequential consistency, and when it is
    done it flips back to causal consistency.

    TSO is cycle-wasteful.
    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf >>>>
    WOW, they wrote article of 7 pages without even one time mentioning
    avoidance of RFO (read for ownership) which is an elephant in the
    room of discussion of advantages of Arm MOM/MCM over TSO.
    What is this "avoidance of RFO"?
    I can find no mention of it anywhere.

    Imagine code that overwrites the whole cache line that core initially
    did not own. Under TSO rules, like x86, the only [non heroic] ways to overwrite the line without reading its previous content (which could
    easily mean reading from DRAM) are
    - aligned AVX512 store
    - rep movs/rep stos

    With weaker MOM the core has option of delayed merging of multiple
    narrow stores. I think that even relatively old ARM cores, like Neoverse N1, are able to do it.

    I can imagine heroic microarchitecture that achieves the same effect
    with TSO, but it seems that so far nobody did it.

    I don't see how a ReadForOwnership message can be avoided as it
    transfers two things: the ownership state, and the current line data.

    InvalidateForOwnership {i.e., CI followed by Allocate Cache line and
    start writing}

    Even if the core knows the whole cache line is being overwritten and
    doesn't need the line data, it still needs the Owned state transfer.

    Which it can get by telling everyone else to loose that cache line.

    There would still be a request message, say TakeOwner TKO which
    has a smaller reply GiveOwner GVO message and just moves the state.
    So the reply is a few less flits.

    As I understand it...
    Independent of the ReadForOwnership message, the ARM weak coherence model should allow stores to other cache lines to proceed, whereas TSO would require younger stores to (appear to) wait until the older store completes.

    Which is why TSO is cycle wasteful.

    The weak coherence model allows the cache to use hit-under-miss for
    stores because it doesn't require the store order to different locations
    be seen in program order. This allows it to overlap younger store cache
    hits with the older ReadForOwnership message, not eliminate it.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 30 20:15:00 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 12/29/2025 1:55 PM, MitchAlsup wrote:
    ----merciful snip--------------
    My 66000 does not have a TSO memory system, but when one of these
    things shows up, it goes sequential consistency, and when it is done
    it flips back to causal consistency.

    TSO is cycle-wasteful.

    But, yeah, was not arguing for using TSO here, rather noting that if one
    has it, then No-Cache can be ignored for CAS.


    But, then again, weak model is cheaper to implement and generally
    faster, although explicit synchronization is annoying and such a model

    There is also implicit synchronization of My 66000. That is, whenever core senses that synchronization is occurring, the weak model drops back to sequential consistency--without any instruction being executed to make
    the model switch. Then after synchronization is over, core switches back
    also without any instruction being executed.

    is incompatible with "lock free data structures" (which tend to
    implicitly assume that memory accesses occur in the same order as
    written and that any memory stores are immediately visible across threads).

    The important property of SC is that all memory accesses become visible
    outside of the core in program order--which allows everyone to understand
    your program order.

    But, then again, one is left with one of several options:
    Ask that people use a mutex whenever accessing any resource that may be modified between threads and where such modifications are functionally important;
    Or, alternatively, use a message passing scheme, where message passing
    can potentially be done more cheaply than a mutex (can be done in
    premise using No-Cache memory rather than needing an L1 flush).

    Well, or use write-through caching, which would mostly have the merit of allowing for cheaper cache flushes (at the expense of being slower in general than write-back caching).


    In the case of RISC-V, there are the FENCE and FENCE.I instructions.
    In my implementation, they are user-land only, and need to be
    implemented as, say:
    FENCE traps, and then performs an L1 flush.
    FENCE.I traps, and then performs an L1 flush,
    then also flushes the I$.

    There is CBO, which allows:
    CBO.FLUSH Reg
    Where, Reg gives the address of a cache line, which is then flushed from
    the L1 D$. This at least mapped over.

    My 6000 does all of that without any FENCE instructions.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 30 20:23:56 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    John Savard wrote:
    On Sun, 21 Dec 2025 20:32:44 +0000, MitchAlsup wrote:
    On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:

    Or in other words, if you can decode K-instructions per cycle, you'd >>>> better be able to execute K-instructions per cycle--or you have a
    serious blockage in your pipeline.

    Not a typo--the part of the pipeline which is <dynamically> narrowest is >> the part that limits performance. I suggest strongly that you should not >> make/allow the decoder to play that part.

    I agree - and strongly, too - that the decoder ought not to be the part that limits performance.

    But what I quoted says that the execution unit ought not to be the part that limits performance, with the implication that it's OK if the decoder does instead. That's why I said it must be a typo.

    So I think you need to look a second time at what you wrote; it's natural for people to see what they expect to see, and so I think you looked at it, and didn't see the typo that was there.

    John Savard

    There are two kinds of stalls:
    stalls in the serial front end I-cache, Fetch or Decode stages because
    of *too little work* (starvation due to input latency),
    and stalls in the back end Execute or Writeback stages because
    of *too much work* (resource exhaustion).

    DECODE latency increases when:
    a) there is no instruction(s) to decode
    b) there is no address from which to fetch
    c) when there is no translation of the fetch address

    a) is a cache miss
    b) is an indirect control transfer
    c) is a TLB miss

    And there may be additional cases of instruction buffer hiccups.

    The front end stalls inject bubbles into the pipeline,
    whereas back end stalls can allow younger bubbles to be compressed out.

    How In-Order your thinking is. GBOoO machine do not inject bubbles.

    If I have to stall, I want it in the back end.

    If I have to stall I want it based on "realized" latency.

    It has to do with catching up after a stall.

    Which is why you do not inject bubbles...

    If a core stalls for 3 clocks, then in order to average 1 IPC
    it must retire 2 instructions per clock for the next 3 clocks.
    And it can only do that if it has a backlog of work ready to execute.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Dec 30 14:50:04 2025
    From Newsgroup: comp.arch

    On 12/30/2025 1:36 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller
    register space, one needs to reuse registers more frequently, which then
    reduces ILP due to register conflicts. So, smaller code at the expense
    of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to
    use the uncompressed instruction. So you may tune your RISC-V
    compiler to prefer registers r8-r15 for those pseudo-registers that
    occur in instructions where such a register allocation may lead to a compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of OoO implementations. On the contrary, write-after-read may be beneficial
    by releasing the old physical register for the register name. And
    designing a compressed CPU instruction set for in-order processing is
    not a good idea for general-purpose computing.


    Though, the main places where compressed instructions are likely to
    bring meaningful benefit, is on small in-order machines.

    Any OoO machine is also likely to have a lot of RAM and a decent sized
    I$, so much of any benefit is likely to go away in this case.


    Except maybe if the compiler produces unreasonable levels of bloat
    (excessive inlining or unrolling), but this is a separate issue.

    Similar logic to predication can be used for loop-unrolling though:
    If the code in question is not suitable for predication (too much code complexity, etc), likely also not suitable for automatic unrolling.


    Things like ALU status flags aren't free either.

    Yes, they cost their own renaming resources.

    Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.

    Major limitations here being more:
    Things like register forwarding cost have non-linear scaling;
    For an in-order machine, usable ILP drops off very rapidly;
    ...

    ILP is a property of a program. I assume that what you mean is that
    the IPC benefits of more width have quickly diminishing returns on
    in-order machines.


    The ILP is a property of the code, yes, but how much exists, and how
    much of it is actually usable, is effected by the processor implementation.

    Of one has a naive in-order machine that does not rename registers, and
    can perform a maximum of 1 memory access per clock-cycle, there is a
    pretty rapid drop off (mostly because in effect, one runs into
    situations that could not be addressed without either renaming registers
    or performing multiple memory accesses in a single cycle).

    One may also find though that, say, an L1 D$ capable of 2 accesses per
    cycle quickly runs into "there be dragons here" territory.



    There seems to be a local optimum between 2 and 3.


    Say, for example, if one had an in-order machine with 5 ALUs, one would
    be hard pressed to find much code that could actually make use of the 5
    ALUs. One can sorta make use of 3 ALUs, but even then, the 3rd lane is
    more often useful for spare register ports and similar (with 3-wide ALU
    being a minority case)

    We have some interesting case studies: The Alpha 21164(a) and the ARM Cortex-A53 and A55. They all are in-order designs, their number of functional units are pretty similar, and, in particular, they all have
    2 integer ALUs. But the 21164 can decode and execute 4 instructions
    per cycle, while the Cortex-A53 and A55 are only two-wide. My guess
    is that this is due to the decoding cost of ARM A32/T32 and A64
    (decoders for two instruction sets, one of which has 16-bit and 32-bit instructions).

    The Cortex-A55 was succeeded by the A510, which is three-wide, and
    that was succeeded by the A520, which is three-wide with two ALUs and supports only ARM A64.

    Widening the A510, which still supports both instruction sets is
    (weak) counterevidence for my theory about why A53/A55 are only
    two-wide at decoding. The fact that the A520 returns to two integer
    ALUs indicates that the third integer ALU provides little IPC benefit
    in an in-order design.


    Yeah, seems to be a general pattern.


    Mine is 3-wide with 3 ALUs, but this is partly because apart from having
    3 ALUs, the 3rd lane couldn't really do anything at all (and the ALU's
    can't migrate between lanes in my case).

    But, as for whether 3-wide ALU scenarios are common? Not particularly in
    most code.

    But, as noted, more often the 3rd lane serves as spare memory ports.
    This allows co-executing an ALU op with an indexed-store, which is a
    more common scenario.

    Otherwise, dropping the 3rd ALU, would be left with a 6R2W machine, but
    then one can debate over whether the two read ports are really worth the
    cost then, and drop back to 4R2W, but then there is the limitation that
    this can no longer co-execute an ALU op with an indexed-store or
    similar, which will negatively effect performance.

    Though, 3-wide with 6R3W offered enough advantages over 2-wide 4R2W to
    make it generally worthwhile. May make sense in the ISA design to assume
    at least 4R2W though, as limiting everything to assuming a maximum of
    2R1W (as RISC-V has usually done for integer instructions) is overly
    limiting.


    In my case, the lanes are asymmetric:
    Lane 1: Can do any operation;
    Memory access, Branches, Multiply/Divide, etc, are Lane 1 exclusive.
    Branches are also Scalar-Only.
    Lane 2: More limited,
    whether some operations are allowed depends on conflict with Lane 1.
    For example, Lane 1 or 2 can use FPU, but not both at the same time;
    Unless the operation pairs into a valid 128-bit SIMD operation.
    Lane 3: Only does basic ALU instructions and similar.
    So, MOV, LI, ADD/SUB/AND/OR/XOR, etc.
    Excludes the ability to do Shifts or other more complex ALU ops.

    For superscalar, CPU will co-execute instructions whenever it encounters
    a pattern it can allow. This is controlled via flags-checking, and then behaving as-if bundles had been encoded using WEX flags.

    Implicitly, this can be stored as a bit pattern for each spot in the cache-line:
    (0): Odd-Size (16 or 48-bit)
    (1): WEX|Jumbo
    (2): WEX
    (3): Jumbo

    So, say:
    0: 32-bit, scalar
    1: 16-bit, scalar
    2: 32-bit, scalar (alternate)
    3: 48-bit, scalar
    4: 64-bit, 2x 32-bit (unused)
    5: 80-bit scalar (possible)
    6: 64-bit, 2x 32-bit
    7: 96-bit, 3x 32-bit
    8: 64-bit jumbo (unused)
    9: -
    A: 64-bit jumbo (unused)
    B: -
    C: 64-bit jumbo (unused)
    D: -
    E: Jumbo Prefix (typical)
    F: 96-bit Jumbo (J52I)

    Potentially, this could be reduced to 3 bits, say:
    000: 16 bit
    001: 32 bit
    010: 48 bit
    011: 64 bit
    100: 80 bit
    101: 96 bit
    110: 128 bit
    111: 192 bit

    Though, the actual output from the I$ is expressed as a length in a
    linear multiple of 16 bits.


    But, the existing logic was based on flags rather than explicitly
    storing the size, and possibly the distinction between 2 or 3 wide ops,
    and jumbo-prefixed forms, does not matter past the fetch length
    determination logic.

    Further down the path, the decoder sees that it was handed 2 or 3 instructions, and deals with this. Originally, it did its own redundant
    length determination. I have now switched to driving this logic entirely
    off the fetch-length given from the I$ (and captured pipeline state for
    which ISA mode is in effect and similar).

    Well, and with the further simplification that now the logic paths for
    jumbo 96 handling and 3-wide fetch have "actually" been merged.


    There is still the wonk that the decoder needs to re-route signals to
    lanes differently depending on fetch width.

    If I were doing a new core (or a possible later rework) it would make
    mode sense to have the IF stage right-align the fetch.

    Or, Say:
    Op16 => - - Op16A (Repack?)
    Op32 => - - Op32A
    Op48 => - Op48B Op48A (Repack)
    Op32B Op32A => - Op32B Op32A
    Op32C Op32B Op32A => Op32C Op32B Op32A


    Or, maybe one could make a case that I have done this stuff backwards
    and Lane1 should come before 2 and 3 rather than after, but alas. For
    dealing with prefix decoding though, etc, it makes sense that Lane 1
    should always be the last instruction in the fetch, rather than the first.

    One would argue that maybe prefixes are themselves wonky, but otherwise
    one needs:
    Instructions that can directly encode the presence of large immediate
    values, etc;
    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one views
    the instruction stream as linear, whereas suffixes add weirdness and are effectively retro-causal, and for any fetch to be safe at the end of a
    cache line one would need to prove the non-existence of a suffix; so
    better to not go there).


    For the most part, superscalar works the same either way, with similar efficiency. There is a slight efficiency boost if it would be possible
    to dynamically reshuffle ops during fetch. But, this is not currently a
    thing in my case.

    This latter case would apply if, say, a MEM op is followed by
    non-dependent ALU ops, which under current superscalar handling they
    will not co-execute, but it could be possible in theory to swap the ops
    and allow them to co-execute.


    ...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Dec 30 14:59:22 2025
    From Newsgroup: comp.arch

    On 12/30/2025 12:00 PM, Scott Lurndal wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 12/28/2025 4:41 PM, BGB wrote:
    [...]

    Also, if using something like LOCK CMPXCHG you MUST make sure to align
    and pad your relevant data structures to a l2 cache line.

    That may not be necessary if there is otherwise no false sharing in
    the same cache line. Yes, the operand should be naturally aligned,
    (which ensures it is entirely contained within a single cache line),
    but there's no reason that other data cannot be stored in the same
    cache line, so long as it is unlikely to be accessed by a competing
    thread.


    Yes, or the "small brain" option of just making the mutex larger than
    the size of the cache line and putting the relevant part in the middle...

    struct PaddedMutex_s {
    u64 pad1, pad2, pad3;
    u64 real_part;
    u64 pad4, pad5, pad6;
    };

    Then say (assuming a 32 byte cache line), no non-pad values can be in
    the same cache line as real_part.

    Little bigger for a 64 byte cache line, but same general idea.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 30 21:36:29 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 12/30/2025 1:36 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller
    register space, one needs to reuse registers more frequently, which then >> reduces ILP due to register conflicts. So, smaller code at the expense
    of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to
    use the uncompressed instruction. So you may tune your RISC-V
    compiler to prefer registers r8-r15 for those pseudo-registers that
    occur in instructions where such a register allocation may lead to a compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of OoO implementations. On the contrary, write-after-read may be beneficial
    by releasing the old physical register for the register name. And designing a compressed CPU instruction set for in-order processing is
    not a good idea for general-purpose computing.


    Though, the main places where compressed instructions are likely to
    bring meaningful benefit, is on small in-order machines.

    Coincidentally; this is exactly where a fatter-ISA wins big::
    compare::

    LDD R7,[IP,R3<<3,DISP32]

    1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against

    AUPIC Rt,lo(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,hi(DISP32)
    ADD Rt,Rt,Ri
    LDD R7,0(Rt)

    5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum.

    Any OoO machine is also likely to have a lot of RAM and a decent sized
    I$, so much of any benefit is likely to go away in this case.

    s/go away/greatly ameliorated/

    ------------------------
    ILP is a property of a program. I assume that what you mean is that
    the IPC benefits of more width have quickly diminishing returns on
    in-order machines.


    The ILP is a property of the code, yes, but how much exists, and how
    much of it is actually usable, is effected by the processor implementation.

    I agree that ILP is more aligned with code than with program.
    {see above example where 1 instruction does the work of 5}
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 30 23:36:49 2025
    From Newsgroup: comp.arch

    On Tue, 30 Dec 2025 17:27:22 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I did not write anything about the clue of Apple. I don't know much
    about the CPUs by Nvidia and Fujitsu. But if there was significant performance to be had by adding a weakly-ordered mode, wouldn't
    especially Fujitsu with its supercomputer target have done it?


    Fujitsu had very strong reason to implement TSO on A64FX - source-level compatibility with SPARC64 VIIIfx and XIfx.
    I wouldn't be surprised if apart from that they have SPARC->ARM Rosetta
    for some customers, but that's relatively minor factor. Supercomputer
    users are considered willing to recompile their code. But much less
    willing to re-write it.

    They also seemed to have ambitions to tackle much bigger and more
    profitable (than supercomputers) general-purpose server marker with
    A64FX derivatives, likely first and foremost targeting their existing
    SPARC64 customers. So far the plan didn't materialize, but it could
    influence A64fx early design decisions.

    Besides, as I mentioned in my other post, A64fx memory subsystem is slow (latency-wise, throughput wise it is very good). I don't know what
    influence that fact has, but I can hand-wave that it shifts the balance
    of cost toward TSO. Also, cache lines are unusually wide (256B), so it
    is possible that RFO shortcuts allowed by weaker MOM are less feasible.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 30 23:49:38 2025
    From Newsgroup: comp.arch

    On Tue, 30 Dec 2025 21:36:29 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    BGB <cr88192@gmail.com> posted:


    Though, the main places where compressed instructions are likely to
    bring meaningful benefit, is on small in-order machines.

    Coincidentally; this is exactly where a fatter-ISA wins big::
    compare::

    LDD R7,[IP,R3<<3,DISP32]

    1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against

    AUPIC Rt,lo(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,hi(DISP32)
    ADD Rt,Rt,Ri
    LDD R7,0(Rt)

    5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit
    minimum.
    Any OoO machine is also likely to have a lot of RAM and a decent
    sized I$, so much of any benefit is likely to go away in this case.



    And where do we have 95% of those small in-order machines? We have them
    in flash-based micro-controllers, more often than not without I$, more
    often than not running at higher clock than sustainable without wait
    states by their program flash with 32-bit data bus. In other words, bottlenecked by instruction fetch before anything else, including
    decode.

    BGB trained his intuition on soft cores in FPGA, where trade offs are completely different. I am heavy users of soft cores too. But I realize
    that 32-bit MCU cores outsell soft cores by more than order of
    magnitude and quite likely by more than 2 orders of magnitude.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Dec 30 14:23:39 2025
    From Newsgroup: comp.arch

    On 12/30/2025 12:59 PM, BGB wrote:
    On 12/30/2025 12:00 PM, Scott Lurndal wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 12/28/2025 4:41 PM, BGB wrote:
    [...]

    Also, if using something like LOCK CMPXCHG you MUST make sure to align
    and pad your relevant data structures to a l2 cache line.

    That may not be necessary if there is otherwise no false sharing in
    the same cache line.-a-a Yes, the operand should be naturally aligned,
    (which ensures it is entirely contained within a single cache line),
    but there's no reason that other data cannot be stored in the same
    cache line, so long as it is unlikely to be accessed by a competing
    thread.


    Yes, or the "small brain" option of just making the mutex larger than
    the size of the cache line and putting the relevant part in the middle...

    struct PaddedMutex_s {
    u64 pad1, pad2, pad3;
    u64 real_part;
    u64 pad4, pad5, pad6;
    };

    Then say (assuming a 32 byte cache line), no non-pad values can be in
    the same cache line as real_part.

    Little bigger for a 64 byte cache line, but same general idea.

    :^) Yeah. That can help. I was referring to the anchor of, say a
    lock-free stack. That anchor better be aligned and padded. An anchor:

    struct ct_anchor
    {
    struct node* next;
    uintptr_t ver;
    };

    ct_anchor is (better be ;^) a double word ripe for a DWCAS say, LOCK
    CMPXCHG8B on a 32 bit system.

    that ct_anchor needs to be properly aligned and padded up to a l2 cache
    line. LL/SC is a different story. The version is not needed because a
    proper LL/SC gets around ABA. But! That single word should be padded and aligned on a reservation granule.

    Now, the struct node's. Heck they can be l2 cache line aligned and
    padded regions of memory. Say a l2 cacheblock lock free allocator.

    Fwiw, here is some of my old code test of a region allocator that can
    help align things. This was before std alignment (say, _Alignof) support
    was widely supported:


    #if ! defined (RALLOC_H)
    # define RALLOC_H
    # if defined (__cplusplus)
    extern "C" {
    # endif
    /**************************************************************/




    #include <stddef.h>
    #include <assert.h>




    #if defined (_MSC_VER)
    /* warning C4116: unnamed type definition in parentheses */
    # pragma warning (disable : 4116)
    #endif




    #if ! defined (NDEBUG)
    # include <stdio.h>
    # define RALLOC_DBG_PRINTF(mp_exp) printf mp_exp
    #else
    # define RALLOC_DBG_PRINTF(mp_exp) ((void)0)
    #endif




    #if ! defined (RALLOC_UINTPTR_TYPE)
    # define RALLOC_UINTPTR_TYPE size_t
    #endif




    typedef RALLOC_UINTPTR_TYPE ralloc_uintptr_type;


    typedef char ralloc_static_assert[
    sizeof(ralloc_uintptr_type) == sizeof(void*) ? 1 : -1
    ];




    enum ralloc_align_enum {
    ALIGN_ENUM
    };


    struct ralloc_align_struct {
    char pad;
    double type;
    };


    union ralloc_align_max {
    char char_;
    short int short_;
    int int_;
    long int long_;
    float float_;
    double double_;
    long double long_double_;
    void* ptr_;
    void* (*fptr_) (void*);
    enum ralloc_align_enum enum_;
    struct ralloc_align_struct struct_;
    size_t size_t_;
    ptrdiff_t ptrdiff_t;
    };


    #define RALLOC_ALIGN_OF(mp_type) \
    offsetof( \
    struct { \
    char pad_RALLOC_ALIGN_OF; \
    mp_type type_RALLOC_ALIGN_OF; \
    }, \
    type_RALLOC_ALIGN_OF \
    )


    #define RALLOC_ALIGN_MAX RALLOC_ALIGN_OF(union ralloc_align_max)


    #define RALLOC_ALIGN_UP(mp_ptr, mp_align) \
    ((void*)( \
    (((ralloc_uintptr_type)(mp_ptr)) + ((mp_align) - 1)) \
    & ~(((mp_align) - 1)) \
    ))


    #define RALLOC_ALIGN_ASSERT(mp_ptr, mp_align) \
    (((void*)(mp_ptr)) == RALLOC_ALIGN_UP(mp_ptr, mp_align))




    struct region {
    unsigned char* buffer;
    size_t size;
    size_t offset;
    };


    static void
    rinit(
    struct region* const self,
    void* buffer,
    size_t size
    ) {
    self->buffer = buffer;
    self->size = size;
    self->offset = 0;

    RALLOC_DBG_PRINTF((
    "rinit(%p) {\n"
    " buffer = %p\n"
    " size = %lu\n"
    "}\n\n\n",
    (void*)self,
    buffer,
    (unsigned long int)size
    ));
    }


    static void*
    rallocex(
    struct region* const self,
    size_t size,
    size_t align
    ) {
    unsigned char* align_buffer;
    size_t offset = self->offset;
    unsigned char* raw_buffer = self->buffer + offset;

    if (! size) {
    size = 1;
    }

    if (! align) {
    align = RALLOC_ALIGN_MAX;
    }

    assert(align == 1 || RALLOC_ALIGN_ASSERT(align, 2));

    align_buffer = RALLOC_ALIGN_UP(raw_buffer, align);

    assert(RALLOC_ALIGN_ASSERT(align_buffer, align));

    size += align_buffer - raw_buffer;

    if (offset + size > self->size) {
    return NULL;
    }

    self->offset = offset + size;

    RALLOC_DBG_PRINTF((
    "rallocex(%p) {\n"
    " size = %lu\n"
    " alignment = %lu\n"
    " origin offset = %lu\n"
    " final offset = %lu\n"
    " raw_buffer = %p\n"
    " align_buffer = %p\n"
    " size adjustment = %lu\n"
    " final size = %lu\n"
    "}\n\n\n",
    (void*)self,
    (unsigned long int)size - (align_buffer - raw_buffer),
    (unsigned long int)align,
    (unsigned long int)offset,
    (unsigned long int)self->offset,
    (void*)raw_buffer,
    (void*)align_buffer,
    (unsigned long int)(align_buffer - raw_buffer),
    (unsigned long int)size
    ));

    return align_buffer;
    }


    #define ralloc(mp_self, mp_size) \
    rallocex((mp_self), (mp_size), RALLOC_ALIGN_MAX)

    #define ralloct(mp_self, mp_count, mp_type) \
    rallocex( \
    (mp_self), \
    sizeof(mp_type) * (mp_count),\
    RALLOC_ALIGN_OF(mp_type) \
    )


    static void
    rflush(
    struct region* const self
    ) {
    self->offset = 0;

    RALLOC_DBG_PRINTF((
    "rflush(%p) {}\n\n\n",
    (void*)self
    ));
    }




    #undef RALLOC_DBG_PRINTF
    #undef RALLOC_UINTPTR_TYPE




    /**************************************************************/
    # if defined (__cplusplus)
    }
    # endif
    #endif

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Dec 30 16:36:49 2025
    From Newsgroup: comp.arch

    On 12/30/2025 3:49 PM, Michael S wrote:
    On Tue, 30 Dec 2025 21:36:29 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    BGB <cr88192@gmail.com> posted:


    Though, the main places where compressed instructions are likely to
    bring meaningful benefit, is on small in-order machines.

    Coincidentally; this is exactly where a fatter-ISA wins big::
    compare::

    LDD R7,[IP,R3<<3,DISP32]

    1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against

    AUPIC Rt,lo(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,hi(DISP32)
    ADD Rt,Rt,Ri
    LDD R7,0(Rt)

    5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit
    minimum.
    Any OoO machine is also likely to have a lot of RAM and a decent
    sized I$, so much of any benefit is likely to go away in this case.



    And where do we have 95% of those small in-order machines? We have them
    in flash-based micro-controllers, more often than not without I$, more
    often than not running at higher clock than sustainable without wait
    states by their program flash with 32-bit data bus. In other words, bottlenecked by instruction fetch before anything else, including
    decode.


    Yes, I will not disagree with this.

    This was actually partly the point I was trying to make, just expressed
    from the other direction.


    BGB trained his intuition on soft cores in FPGA, where trade offs are completely different. I am heavy users of soft cores too. But I realize
    that 32-bit MCU cores outsell soft cores by more than order of
    magnitude and quite likely by more than 2 orders of magnitude.


    I don't see where the disagreement is here, exactly.


    I was not saying Compressed instructions don't make sense for microcontrollers, rather, this is the main place they *do* make sense.

    Rather, it on Desktop PC and Server class systems (or, IOW, the ones
    likely to have OoO processors), is where compressed instructions would
    likely stop bringing all that much benefit, as the I$ is bigger and RAM
    is basically unlimited (at least provided your code density isn't IA-64
    levels of bad).

    So, for your Application Class processors, 32-bit only or 32/64/96 or
    similar, would likely make more sense.


    But, on the other side, say, RV32IMC or similar is a good choice for a microcontroller. But, for a PC or similar, not so much.


    So, the point is that it does make sense to design a compressed ISA to optimize around the constraints of small in-order CPUs, as these are the
    place where compressed instructions are most likely to be "actually useful".


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Dec 30 22:57:29 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 30 Dec 2025 11:13:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    Do you happen to have benchmarks that compare performance of Alpha
    EV5 vs in-order Cortex-A ?

    LaTeX benchmark results (lower is waster)

    Alpha:
    - 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5) 8.1
    ARM A32/T32:
    - Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
    ARM A64:
    - Rockpro64 (1416MHz Cortex A53) Debian 9 (Stretch) 3.24
    - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04
    2.488
    - Odroid C2 (1536MHz Cortex A53) Ubuntu 16.04 2.32
    - Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended)
    2.105

    A problem with the LaTeX benchmark is that it performance is
    significantly influenced by the LaTeX installation (newer versions
    need more instructions, and having more packages needs more
    instructions). But it's the only benchmark results I have.

    - anton

    Thank you.
    Two 64-bit A53 results about the same as EV5 clock for clock and one
    result is significantly better.
    So, either wide in-order is indeed not bright idea or 21164 suffers
    because of inferioriority of Alpha ISA relatively to ARM64.

    Hard to tell from these results. In addition to the problems
    mentioned above there are also differences in cache configuration to >consider. And the A55 does quite a bit better in IPC than the A53,
    although it superficially has the same resources.

    The A5x cores are pretty old now. Do you have any results for
    Neoverse-V2, at say 2Ghz?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Dec 30 14:58:24 2025
    From Newsgroup: comp.arch

    On 12/30/2025 11:10 AM, BGB wrote:
    [...]

    But, then again, weak model is cheaper to implement and generally
    faster, although explicit synchronization is annoying and such a model
    is incompatible with "lock free data structures" (which tend to
    implicitly assume that memory accesses occur in the same order as
    written and that any memory stores are immediately visible across threads).

    Fwiw, a weak memory model is totally compatible with lock-free data structures. A weak model tends to have the necessary memory barriers to
    make them work. Have you ever used a SPARC in RMO mode? Acquire membar
    ala std::memory_order_acquire is basically a MEMBAR #LoadStore |
    #LoadLoad. A release is MEMBAR #LoadStore | #StoreStore. Those can be
    used for the implementation of a mutex. Notice how acquire and release
    never need #StoreLoad ordering?

    The point is that once we have this flexibility, a lock/wait free algo
    can use the right membars for the job. Ideally, the weakest membars they
    can use to ensure they are correct in their logic.

    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Dec 30 15:07:22 2025
    From Newsgroup: comp.arch

    On 12/30/2025 2:58 PM, Chris M. Thomasson wrote:
    On 12/30/2025 11:10 AM, BGB wrote:
    [...]

    But, then again, weak model is cheaper to implement and generally
    faster, although explicit synchronization is annoying and such a model
    is incompatible with "lock free data structures" (which tend to
    implicitly assume that memory accesses occur in the same order as
    written and that any memory stores are immediately visible across
    threads).

    Fwiw, a weak memory model is totally compatible with lock-free data structures. A weak model tends to have the necessary memory barriers to
    make them work. Have you ever used a SPARC in RMO mode? Acquire membar
    ala std::memory_order_acquire is basically a MEMBAR #LoadStore |
    #LoadLoad. A release is MEMBAR #LoadStore | #StoreStore. Those can be
    used for the implementation of a mutex. Notice how acquire and release
    never need #StoreLoad ordering?

    The point is that once we have this flexibility, a lock/wait free algo
    can use the right membars for the job. Ideally, the weakest membars they
    can use to ensure they are correct in their logic.

    [...]


    Wrt to TSO aka x86/x64..., even that is NOT strong enough to get
    #StoreLoad, aka ordering a store followed by a load to another location.
    You need a LOCK'ed RMW or the MFENCE instruction.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Dec 30 23:20:02 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    TSO is cycle-wasteful.

    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf

    "All swans are white" has been "experimentally verified" by finding one
    white swan. The existence of black swans shows that such
    "experimental verifications" are fallacies.

    If you have any counter-examples, please feel free to cite them.

    The fact that nobody had counterexamples for the theory "All swans are
    white" for a long while did not make that theory true.


    But I have actually seen black swans in Australia and elsewhere.
    Here's a picture
    <https://en.wikipedia.org/wiki/File:Black_Swan_in_Flight_Crop.jpg>

    Your black swan analogy is a red herring. There is no theory of
    evolution that says swans have to be white, and this is indeed a
    very rare color for water birds.

    On the other hand, the theory that additional constraints on
    memory ordering incur additional overhead is very plausible.


    Actually the existence of a weak-memory-model mode on specific
    hardware makes it very likely that TSO is slower than the weak model
    on that hardware. If TSO was implemented to provide the same speed,
    there would be no need for also providing a weaker memory ordering
    mode on that hardware.

    That makes little sense. TSO fulfulls all the requirements of the
    ARM memory model, and adds some on top. It is possible to create
    an ARM CPU which uses TSO, as you wrote below. If that had
    the same performance as a CPU running on the pure ARM memory model,
    there would be no reason to implement the ARM memory model at all.

    Exactly. So you will only have a TSO mode and a weak mode, if, for
    your particular implementation, the weak mode provides a performance advantage over the TSO mode. So it is unlikely that you ever see an implementation with such a mode bit where TSO mode has the same
    performance, unless the mode bit is there only for backwards
    compatibility and actually has no effect (what TRAPB turned into on
    the 21264).

    We have ARM implementations with TSO without mode bit. Do you really
    need to add a mode bit to them that has no effect and therefore
    produces 0% difference in performance to accept them as
    counterexample?

    I do not accept them as a counterexample, because of two reasons:
    First, these do not seem to be high performace (from what
    Michael S wrote) and second, there are no benchmrks checking
    both possibilities (as is the case for Apple Silicon).

    Fujitsu might have accepted a performance loss for other reasons, for
    example for ease of porting from x86.


    So, two alternatives:

    a) Apple engineers did not have the first clue what they were doing
    b) Apple engineers knew what they were doing

    My theory is that Apple engineers first implemented the weak model,
    because that's what the specification says, and they were not tasked
    to implement something better; there is enough propaganda for weak
    memory models around that a hardware engineer might think that that's
    the way to go. Later (for the M1) the Apple hardware designers were
    asked to implement TSO, and they did not redo the whole memory model
    from the ground up, but by doing relatively small changes. And of
    course the result is that their TSO mode is slower than their weak
    mode, just as -mieee-float on a 21164 is slower than code compiled
    without that mode.

    So, you are assuming incompetent project management on the side
    of Apple. That falls into category b); thanks for confirming that.

    It is still a wild, theory which happens to fit your presonal
    prejudice, but is unsupportd by any known facts (unless you have
    a source; if so, please cite).

    But these chips were M1, and there have been quite a lot of changes
    up to M4; three releases for Apple to get things right.

    You seem to be doing a lot of benchmarks; I hope you could find it
    in your institute budget to buy a M4 box and run such benchmarks
    yourself. You can then reproduce the results of the publication
    above, or not reprduce them. At least, that would give another
    datapoint.

    However, there is another possibility for an experimental
    verification. IIRC, RISC-V offers two memory models, one is TSO,
    the other is weakly ordered. (If I understand your point, you
    think there should no performance difference between the two,
    and everybody should just use the Ztso extension. But RISC-V
    made a lot of strange decision, so I am not counting that as an
    argument here). Tt might be possible to find or write a RISC-V core
    which implements both the standard memory ordering, depending on a configuration switch, and then compare the performance of the two.
    Might make a nice master's thesis.


    Given that Apple silicon seems to be competently done

    The 21164 was the fastest CPU during some of its days. And yet the
    21264 was faster without needing TRAPB.

    And this is relevant to today's chips... how? The Alpha profited
    from extremely high frequency made possible by good floor
    enginering.


    I personally
    think that option a) is the better one. You obviously prefer
    option b).

    An absurdity typical of you.

    Which you confirmed, above.

    But I am getting the impression that your reading comprehension
    suffers when you read my posts...


    Note that "some Nvidia and Fujitsu [ARM architecture] implementations
    run with TSO at all times" <https://lwn.net/Articles/970907/>. Given
    that the Fujitsu implementations of the ARM architecture are used in
    supercomputers, it is unlikely that their TSO implementation is
    cycle-wasteful.

    So, you're saying that Apple is clueless, and that Nvidia and Fujitsu
    make that decision solely on the basis of speed?

    I did not write anything about the clue of Apple. I don't know much
    about the CPUs by Nvidia and Fujitsu. But if there was significant performance to be had by adding a weakly-ordered mode, wouldn't
    especially Fujitsu with its supercomputer target have done it?

    Not necessarily.

    If hardware designers get tasked with implementing TSO (or sequential consistency) with modern transistor budgets, they hopefully come up
    with different solutions to various problems than if they are tasked
    to do a weak model with a slow TSO option.

    Why should anybody task them with making TSO slower than it
    needs to be? That makes no sense.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 31 01:26:06 2025
    From Newsgroup: comp.arch

    On Tue, 30 Dec 2025 22:57:29 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 30 Dec 2025 11:13:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    Do you happen to have benchmarks that compare performance of
    Alpha EV5 vs in-order Cortex-A ?

    LaTeX benchmark results (lower is waster)

    Alpha:
    - 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5)
    8.1 ARM A32/T32:
    - Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8
    5.46 ARM A64:
    - Rockpro64 (1416MHz Cortex A53) Debian 9 (Stretch)
    3.24
    - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04
    2.488
    - Odroid C2 (1536MHz Cortex A53) Ubuntu 16.04
    2.32
    - Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended)
    2.105

    A problem with the LaTeX benchmark is that it performance is
    significantly influenced by the LaTeX installation (newer versions
    need more instructions, and having more packages needs more
    instructions). But it's the only benchmark results I have.

    - anton

    Thank you.
    Two 64-bit A53 results about the same as EV5 clock for clock and one >>result is significantly better.
    So, either wide in-order is indeed not bright idea or 21164 suffers >>because of inferioriority of Alpha ISA relatively to ARM64.

    Hard to tell from these results. In addition to the problems
    mentioned above there are also differences in cache configuration to >consider. And the A55 does quite a bit better in IPC than the A53, >although it superficially has the same resources.

    The A5x cores are pretty old now. Do you have any results for
    Neoverse-V2, at say 2Ghz?


    We wanted to compare 2-wide in-order core with 2 Integer ALU pipes vs
    4-wide in-order core also with 2 Integer ALU pipes.

    Of course, modern 4-wide OoO core like one in Neoverse N1 or V2 is
    massively faster than those, but this fact is not especially
    illuminating.









    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Dec 30 15:51:38 2025
    From Newsgroup: comp.arch

    On 12/30/2025 12:08 PM, MitchAlsup wrote:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Michael S wrote:
    On Tue, 30 Dec 2025 10:44:10 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    My 66000 does not have a TSO memory system, but when one of these >>>>>>> things shows up, it goes sequential consistency, and when it is
    done it flips back to causal consistency.

    TSO is cycle-wasteful.
    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf >>>>>>
    WOW, they wrote article of 7 pages without even one time mentioning
    avoidance of RFO (read for ownership) which is an elephant in the
    room of discussion of advantages of Arm MOM/MCM over TSO.
    What is this "avoidance of RFO"?
    I can find no mention of it anywhere.

    Imagine code that overwrites the whole cache line that core initially
    did not own. Under TSO rules, like x86, the only [non heroic] ways to
    overwrite the line without reading its previous content (which could
    easily mean reading from DRAM) are
    - aligned AVX512 store
    - rep movs/rep stos

    With weaker MOM the core has option of delayed merging of multiple
    narrow stores. I think that even relatively old ARM cores, like Neoverse >>> N1, are able to do it.

    I can imagine heroic microarchitecture that achieves the same effect
    with TSO, but it seems that so far nobody did it.

    I don't see how a ReadForOwnership message can be avoided as it
    transfers two things: the ownership state, and the current line data.

    InvalidateForOwnership {i.e., CI followed by Allocate Cache line and
    start writing}

    Even if the core knows the whole cache line is being overwritten and
    doesn't need the line data, it still needs the Owned state transfer.

    Which it can get by telling everyone else to loose that cache line.

    There would still be a request message, say TakeOwner TKO which
    has a smaller reply GiveOwner GVO message and just moves the state.
    So the reply is a few less flits.

    As I understand it...
    Independent of the ReadForOwnership message, the ARM weak coherence model
    should allow stores to other cache lines to proceed, whereas TSO would
    require younger stores to (appear to) wait until the older store completes.

    Which is why TSO is cycle wasteful.

    Wrt to TSO aka x86/x64..., even that is NOT strong enough to get
    #StoreLoad, aka ordering a store followed by a load to another location.
    You need a LOCK'ed RMW or the MFENCE instruction.



    The weak coherence model allows the cache to use hit-under-miss for
    stores because it doesn't require the store order to different locations
    be seen in program order. This allows it to overlap younger store cache
    hits with the older ReadForOwnership message, not eliminate it.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Dec 31 01:32:34 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    TSO is cycle-wasteful.

    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf

    "All swans are white" has been "experimentally verified" by finding one >>>> white swan. The existence of black swans shows that such
    "experimental verifications" are fallacies.

    If you have any counter-examples, please feel free to cite them.

    The fact that nobody had counterexamples for the theory "All swans are
    white" for a long while did not make that theory true.


    But I have actually seen black swans in Australia and elsewhere.
    Here's a picture >><https://en.wikipedia.org/wiki/File:Black_Swan_in_Flight_Crop.jpg>

    Your black swan analogy is a red herring. There is no theory of
    evolution that says swans have to be white, and this is indeed a
    very rare color for water birds.

    Rare, perhaps, but there's a Snowy Egret in my front yard
    as I type this.... I wouldn't call them very rare. Then
    there are also the white pelican and the Great Egret; one
    often considers the bog standard seagull to be white (albeit
    with grey highlights).


    https://www.youtube.com/watch?v=GoeZsTLvSd0
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Dec 30 23:21:20 2025
    From Newsgroup: comp.arch

    On 12/30/2025 4:58 PM, Chris M. Thomasson wrote:
    On 12/30/2025 11:10 AM, BGB wrote:
    [...]

    But, then again, weak model is cheaper to implement and generally
    faster, although explicit synchronization is annoying and such a model
    is incompatible with "lock free data structures" (which tend to
    implicitly assume that memory accesses occur in the same order as
    written and that any memory stores are immediately visible across
    threads).

    Fwiw, a weak memory model is totally compatible with lock-free data structures. A weak model tends to have the necessary memory barriers to
    make them work. Have you ever used a SPARC in RMO mode? Acquire membar
    ala std::memory_order_acquire is basically a MEMBAR #LoadStore |
    #LoadLoad. A release is MEMBAR #LoadStore | #StoreStore. Those can be
    used for the implementation of a mutex. Notice how acquire and release
    never need #StoreLoad ordering?

    The point is that once we have this flexibility, a lock/wait free algo
    can use the right membars for the job. Ideally, the weakest membars they
    can use to ensure they are correct in their logic.


    Usually IME the people writing lock-free code don't use memory barriers
    or similar though. A lot of times IME, it is people just using volatile
    or similar and trying to write things in a way that it (hopefully) wont
    go terribly wrong if two threads hit the same data at the same time.

    Like, the sort of code that works on a PC running Windows or similar,
    but try to port it to Linux on an ARM machine, and it explodes.


    Where, say, using volatile isn't sufficient for multiple cores with a
    weak model. One would need either to use barriers (though, in my case, barriers will also be slow), non-cached memory accesses, or explicit cache-line flushing.


    In this case, this leaves it often preferable to use bulk mostly
    read-only data sharing. Or, passing along data via buffers or messages
    (with some level of basic flow control).

    So, not so much "lets have two threads share a doubly-linked list and
    hope it doesn't all turn into a train wreck", and more "will copy
    messages onto the end of a circular buffer and advance the roving
    pointers; manually flushing the lines corresponding to the parts of the
    buffer than have been updated in the process".

    Say, for example:
    void _flushbuffer(void *data, size_t sz)
    {
    char *ct, *cte;
    ct=data; cte=ct+sz;
    while(ct<cte)
    { __mem_flushaddr(ct); ct+=LINESIZE; }
    }
    void _memcpyout_flush(void *dst, void *src, size_t sz)
    {
    memcpy(dst, src, sz);
    _flushbuffer(dst, sz);
    }
    void _memcpyin_flush(void *dst, void *src, size_t sz)
    {
    _flushbuffer(src, sz);
    memcpy(dst, src, sz);
    }
    void _memcpy_flush(void *dst, void *src, size_t sz)
    {
    _flushbuffer(src, sz);
    memcpy(dst, src, sz);
    _flushbuffer(dst, sz);
    }

    Where, in this case, normal memcpy + flushing is likely to be faster in
    many cases than using non-cached memory.

    ...


    Granted, this works, but isn't usually what people refer to as "lock free".

    If albeit, copying+flushing, and then using non-cached operations to
    access the buffer address pointers, is simpler to make work.


    well, and I often have the seemingly atypical approach to
    multi-threading of dividing things up by "tasks" or "areas of
    responsibility" rather than having each thread compete for use of shared
    data structures.

    But, admittedly, I tend not to use multi-threading that often (well, and
    if I do, a lot of times it is for something like a real-time video
    encoder or genetic-algorithm or similar; if there isn't some way to get acceptable speed out of a single-threaded implementation).

    But, alas, for most things I often don't need to use multi-threading (at least, not enough to justify the hassle of doing so).


    Well, and even then, there is an annoyance that for whatever reason on
    mu main PC, for normal multi-threaded programs it seems to only schedule threads within a given process within two logical processors at a time.
    Never quite figured out what is going on here (as-is, it means needing
    to spawn multiple processes to use more than 12% of the CPU; which then
    means needing to deal with sockets or similar).

    Well, and then one may as well just do threads via message passing,
    since message passing still works if one spawns processes and has the processes communicating via sockets.

    But, a lot depends on what one is doing...



    [...]


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Dec 31 03:23:32 2025
    From Newsgroup: comp.arch

    <snip>

    One would argue that maybe prefixes are themselves wonky, but otherwise
    one needs:
    Instructions that can directly encode the presence of large immediate values, etc;
    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one views
    the instruction stream as linear, whereas suffixes add weirdness and are effectively retro-causal, and for any fetch to be safe at the end of a
    cache line one would need to prove the non-existence of a suffix; so
    better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction. I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.

    Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum
    number of suffixes is present and ensure the cache-line is wide enough.
    Or limit the number of suffixes so they fit into the half cache-line
    used for spanning.

    It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
    the start of an instruction then does not have to worry about accounting
    for a prefix / suffix.



    For the most part, superscalar works the same either way, with similar efficiency. There is a slight efficiency boost if it would be possible
    to dynamically reshuffle ops during fetch. But, this is not currently a thing in my case.

    This latter case would apply if, say, a MEM op is followed by non-
    dependent ALU ops, which under current superscalar handling they will
    not co-execute, but it could be possible in theory to swap the ops and
    allow them to co-execute.


    ...


    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 31 15:19:00 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    Do you have any results for
    Neoverse-V2, at say 2Ghz?

    No. You can find other results at

    https://www.complang.tuwien.ac.at/franz/latex-bench

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Dec 31 10:46:18 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Well, Mitch claims average 35 bits per instructions, that means about >>>>>> 90% utilization of decoders, so not bad.
    His minimum instruction size is 32 bits, but I was going for 16 bits. >>>> BTW, my understanding of Mitch's design is that this is related to
    instruction complexity: if you support 16bit instructions, it means you >>>> support instructions which presumably don't do very much work because
    it's hard to express a lot of "work to do" in 16bit.
    A bit of statistics on that.

    Using a primitive Perl script to catch occurences, on a recent
    My 66000 cmopiler, of the shape

    [op] Ra,Ra,Rb
    [op] Ra,Rb,Ra
    [op] Ra,#n,Ra
    [op] Ra,Ra,#n
    [op] Ra,Rb

    where |n| < 32, which could be a reasonable approximation of a
    compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
    and 23.9% (GSL) of such instructions. Potential space savings
    would be a bit less than half that.

    Better compression schemes are certainly possible, but I think the
    disadvantages of having more complex encodings outweigh any
    potential savings in instruction size.
    The % associations you measured above might just be coincidence.

    I have assumed for a compiler to choose between two instruction formats,
    a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
    that the register allocator would check if either operand was alive after
    the OP, and if not then that source register can be reused as the dest.
    For some ISA that may allow a shorter instruction format to be used.

    Compilers will try to re-use registers as much as possible, in
    other words, to avoid dead registers. If the compiler determines
    that, for the pseudo registers V1, V2 and V3,

    V1 = V2 - V3;

    V2 is no longer live after that statement, it will assign
    the same hard register to V1 and V2 (unless there are other
    considerations such as function return values) which will then
    either be translated into

    add r1,r1,-r2

    for a three-register instruction, or, for example, into

    subq %rsi, %rax

    Hmm... thinking of the statistics above, maybe I should have
    included the minus signs.

    Your stats above assume the compiler is performing this optimization
    but since My 66000 does not have short format instructions the compiler
    would have no reason to do so. Or the compiler might be doing this
    optimization anyways for other ISA such as x86/x64 which do have
    shorter formats.

    So the % numbers you measured might just be coincidence and could be low.
    An ISA with both short 2- and long 3- register formats like RV where there >> is an incentive to do this optimization might provide stats confirmation.

    RISC-V compressed mode also uses three-bit register numbers for
    popular registers, all of which complicates decoding and causes
    other problems which Mitch has explained previously.

    So yes, a My 66000-like instruction set with compression might be
    possible, but would almost certainly not be realized.

    I wasn't suggesting My 66000 should be using compressed instructions.
    Just that the the stats you got from its obj code might not apply
    because extra optimization work is required to coerce values to be
    in the right registers so it can take advantage of size compression.

    The packing algorithm I described above should work for my ISA because I
    have the same number of registers, 16, in all instruction size variations.

    The packing algorithm for RV or similar is more complicated because
    it uses different register set sizes, RV is 31 or 7 with a zero reg.
    After identifying which variables die in each operation,
    the register allocator has to "arrange" that the dying variable lands
    in one of those 7 registers so that the dying op can reuse the register.

    This also interacts with the ABI in passing register arguments.
    E.g. maybe it gets better compression if the stack pointer is
    one of those 7 registers. Also remember the RV ABI needs I think
    1 or 2 registers for building a 32-bit RIP-relative branch address.

    Compact instructions may also have smaller immediate fields.
    An optimizer might try to arrange stack variable offset locations
    so that the more frequent ones use the small offsets.

    A bit of poking about finds a number of research papers on compact ISA's.
    This one explores their improved algorithm on LLVM.
    Fig. 2 shows the % share of compact instructions before and after
    their compression-aware algorithms. Just eye-balling the graph I'd
    say 50% of RV instructions used compressed format.

    Register Allocation for Compressed ISAs in LLVM 2023 https://dl.acm.org/doi/pdf/10.1145/3578360.3580261


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 31 15:25:18 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 30 Dec 2025 17:27:22 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I did not write anything about the clue of Apple. I don't know much
    about the CPUs by Nvidia and Fujitsu. But if there was significant
    performance to be had by adding a weakly-ordered mode, wouldn't
    especially Fujitsu with its supercomputer target have done it?


    Fujitsu had very strong reason to implement TSO on A64FX - source-level >compatibility with SPARC64 VIIIfx and XIfx.

    Apple also had good reason to implement TSO on M1: AMD64->ARM A64
    binary translation (Rosetta). They chose to add a slower TSO mode to
    their weak memory system, which is not surprising given that they had
    a working weak memory system, and it is relatively easy to implement
    TSO on that (with a performance penalty).

    I wouldn't be surprised if apart from that they have SPARC->ARM Rosetta
    for some customers, but that's relatively minor factor. Supercomputer
    users are considered willing to recompile their code. But much less
    willing to re-write it.

    While supercomputer users may not be particularly willing to rewrite
    their code, they are much more willing than anyone else, because in supercomputing, hardware cost still is higher than software cost.

    If there was an easy way to offer "5-10% more performance" to those
    users willing to write or use software written for weak memory models
    by adding a weak memory mode to A64FX, I would be very surprised if
    they would have passed. So I conclude that it's not easy to turn
    their memory model into a weak one and gain performance.

    Concerning their SPARC implementations: The SPARC architecture
    specifies both TSO and a weak memory model. Does your comment about
    SPARC64 VIIIfx and XIfx mean that Fujitsu only implemented TSO on
    those CPUs and that when you asked for the weak mode on those CPUs,
    you still got TSO? That would be the counterexample that Thomas
    Koenig asked for.

    Besides, as I mentioned in my other post, A64fx memory subsystem is slow >(latency-wise, throughput wise it is very good).

    Sounds to me like it is designed for a supercomputer.

    I don't know what
    influence that fact has, but I can hand-wave that it shifts the balance
    of cost toward TSO.

    Can you elaborate on that?

    Also, cache lines are unusually wide (256B), so it
    is possible that RFO shortcuts allowed by weaker MOM are less feasible.

    Why should that be?

    A particular aspect here is that RFO is rare in applications with good
    temporal locality. Supercomputer applications tend to have relatively
    bad temporal locality and will see RFO more often.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 31 15:50:13 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Anton Ertl [2025-12-30 17:15:59] wrote:
    Any sequence of stores without intervening loads can be turned into
    one store under sequential consistency, and therefore also under the
    weaker TSO. Doing that for a sequence that stores into one cache line
    does not appear particularly heroic to me. The question is how much
    benefit one gets from this optimization.

    But the stores may be interleaved with loads from other locations!
    It's quite common to have a situation where a sequence of stores
    initializes a new object and thus overwrites a complete cache line, but
    that initialization sequence needs to read from memory (e.g. from the
    stack).

    Maybe compilers can be taught to group such writes to try and avoid
    the problem?

    What compilers can do depends on the programming language. But this
    "read from stack" idea is curious. We have had good register
    allocators for several decades, so local variables tend to reside in
    registers, not on some stack. Parameters also tend to reside in
    registers. So if you have a C initializing function

    void init_foo(foo_t *foo, long a, long b, /* ... */)
    {
    size_t i;
    foo->a = a;
    foo->b = b;
    for (i=0; i<FOO_C_ELEMS; i++)
    foo->c[i] = 0;
    }

    it is unlikely that there will be loads between the stores.

    In other cases you can reorder the loads and stores by hand. Instead of

    a = y->a;
    b = y->b;
    ...

    you can do

    long a = y->a;
    long b = y->b;
    ...
    a = a;
    b = b;
    ...

    This kind of reordering only needs to be performed where it eliminates
    many RFOs, and is much easier to get correct than not-too-slow code
    for weak memory models that has to be done everywhere where shared
    written-to memory is accessed and has to be correct everywhere (and not-too-slow everywhere that is executed frequently).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 31 18:19:56 2025
    From Newsgroup: comp.arch

    On Wed, 31 Dec 2025 15:19:00 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    scott@slp53.sl.home (Scott Lurndal) writes:
    Do you have any results for
    Neoverse-V2, at say 2Ghz?

    No. You can find other results at

    https://www.complang.tuwien.ac.at/franz/latex-bench

    - anton

    I'd expect Neoverse N1 to be about equal to Sandy/Ivy Bridge clock for
    clock.
    But looking at your database, SB/IB results are not terribly consitent.
    - Core i3-3227, 1900MHz, Lenovo Thinkpad e130, Ubuntu 13.10 64b 0.602
    - Core i7-3520M (3.6GHz Turbo), 4MB L2, Ubuntu 21.04 texlive 2021 0.50
    - Core i5-2520M, 2500MHz, 3MB L2, Ubuntu 18.04.4 0.422
    - Core i5-2380P, 3100MHz, Suse 12.1 64 bit pdfTeX (TeX Live 2012) 0.392
    - Core i7-2720QU, 2200MHz, 6MB L3, Linux Mint 11 Katya (64-bit) 0.34s
    - Core i7-2600K, 4200MHz, 8MB L3, Ubuntu 10.10 (64-bit) 0.270
    - Core i7-3930K, 4200MHz, 12MB L3, Ubuntu 12.04 (64-bit) 0.256

    There is 2.4x difference in IPC between the worst and the best.

    Which one do we have to believe? That is, except for i7-2720QU which is
    not even official Intel SKU and where test very obviously
    mis-reported clock frequency and almost certainly was running at 3.3
    GH.
    Even without this problematic score there is still 1.7x difference.
    Then, looking closer, i7-3520M is also the case of mis-reported clock frequency, but in opposit direction. In reality, the test was running
    at much lower frequency, likely even lower than base frequency of this
    core (2900).
    The rest of them is not too far away from each other. So, taking median (i7-2600K score*freq = 1134) sounds about right.

    Back to Neoverse.
    V2 is HPC-optimized core. I'd expect that difference in latex-bench IPC
    between N1 and V2 will be much smaller than reported in other
    benchmarks. Probably, factor of 1.25 to 1.35 in favor of V2.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Dec 31 11:29:58 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    EricP <ThatWouldBeTelling@thevillage.com> posted:

    John Savard wrote:
    On Sun, 21 Dec 2025 20:32:44 +0000, MitchAlsup wrote:
    On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
    Or in other words, if you can decode K-instructions per cycle, you'd >>>>>> better be able to execute K-instructions per cycle--or you have a
    serious blockage in your pipeline.
    Not a typo--the part of the pipeline which is <dynamically> narrowest is >>>> the part that limits performance. I suggest strongly that you should not >>>> make/allow the decoder to play that part.
    I agree - and strongly, too - that the decoder ought not to be the part >>> that limits performance.

    But what I quoted says that the execution unit ought not to be the part >>> that limits performance, with the implication that it's OK if the decoder >>> does instead. That's why I said it must be a typo.

    So I think you need to look a second time at what you wrote; it's natural >>> for people to see what they expect to see, and so I think you looked at >>> it, and didn't see the typo that was there.

    John Savard
    There are two kinds of stalls:
    stalls in the serial front end I-cache, Fetch or Decode stages because
    of *too little work* (starvation due to input latency),
    and stalls in the back end Execute or Writeback stages because
    of *too much work* (resource exhaustion).

    DECODE latency increases when:
    a) there is no instruction(s) to decode
    b) there is no address from which to fetch
    c) when there is no translation of the fetch address

    a) is a cache miss
    b) is an indirect control transfer
    c) is a TLB miss

    And there may be additional cases of instruction buffer hiccups.

    Yes. Also Decode generated stalls - pipeline drain.
    Rename stall for new dest register pool exhaustion.

    The front end stalls inject bubbles into the pipeline,
    whereas back end stalls can allow younger bubbles to be compressed out.

    How In-Order your thinking is. GBOoO machine do not inject bubbles.

    You get bubbles if you overload their resources no matter how GB it is.

    For example, if all the reservation stations for a FU are in use then
    Dispatch has to stall, which stalls the whole front end.
    A compacting pipeline in the front end can compress out those bubbles
    but it eventually stalls too.

    Dependency stalls - all the uOps in reservation stations are waiting
    on other results. Serialization stalls.

    If a design is doing dynamic register file read port assignment and
    runs out of read ports. Resource exhaustion stalls.

    Multiple uOps are ready but only one can launch. Scheduling stalls.

    If I have to stall, I want it in the back end.

    If I have to stall I want it based on "realized" latency.

    It has to do with catching up after a stall.

    Which is why you do not inject bubbles...

    It's not me doing it. I blame the speed of light.

    If a core stalls for 3 clocks, then in order to average 1 IPC
    it must retire 2 instructions per clock for the next 3 clocks.
    And it can only do that if it has a backlog of work ready to execute.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 31 16:20:34 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    The packing algorithm for RV or similar is more complicated because
    it uses different register set sizes, RV is 31 or 7 with a zero reg.

    Actually, in RISC-V those compressed instructions with the 3-bit
    register field mean the registers r8-r15 or f8-f15, none of which is
    the zero register, so it is free to allocate 8 registers.

    Looking at the "18 RISC-V Compressed ISA V1.9" specification, there is
    C.LI, which expands into a 32-bit instruction (ADDI) with x0 as one
    source register; there is no field for x0 (zero) in C.LI (there are no alternative registers). There is also C.MV (with only rd and rs2
    fields), which expands into add rd, x0, rs2.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 31 16:52:22 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Looking at the "18 RISC-V Compressed ISA V1.9" specification

    Someone asked for more and dynamic numbers. This work contains them
    in Section 1.9:

    |Table 1.7 lists the standard RVC instructions with the most frequent
    |first, showing the individual contributions of those instructions to
    |static code size and then the running total for three experiments: the
    |SPEC benchmarks for both RV32C and RV64C for the Linux kernel. For
    |RV32, RVC reduces static code size by 24.5% on Dhrystone and 30.9% on |CoreMark. For RV64, it reduces static code size by 26.3% on SPECint,
    |25.8% on SPECfp, and 31.1% on the Linux kernel.
    |
    |Table 1.8 ranks the RVC instructions by order of typical dynamic
    |frequency. For RV32, RVC reduces dynamic bytes fetched by 29.2% on
    |Dhrystone and 29.3% on CoreMark. For RV64, it reduces dynamic bytes
    |fetched by 26.9% on SPECint, 22.4% on SPECfp, and 26.11% booting the
    |Linux kernel.

    If you want the tables, look at source: <https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 31 17:12:10 2025
    From Newsgroup: comp.arch


    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    BGB <cr88192@gmail.com> posted:

    On 12/30/2025 1:36 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller
    register space, one needs to reuse registers more frequently, which then >> reduces ILP due to register conflicts. So, smaller code at the expense >> of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to
    use the uncompressed instruction. So you may tune your RISC-V
    compiler to prefer registers r8-r15 for those pseudo-registers that
    occur in instructions where such a register allocation may lead to a compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of OoO implementations. On the contrary, write-after-read may be beneficial
    by releasing the old physical register for the register name. And designing a compressed CPU instruction set for in-order processing is
    not a good idea for general-purpose computing.


    Though, the main places where compressed instructions are likely to
    bring meaningful benefit, is on small in-order machines.

    Coincidentally; this is exactly where a fatter-ISA wins big::
    compare::

    LDD R7,[IP,R3<<3,DISP32]

    1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against

    It is only 2 words

    AUPIC Rt,lo(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,hi(DISP32)
    ADD Rt,Rt,Ri
    LDD R7,0(Rt)

    5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum.

    This should be::

    AUPIC Rt,hi(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,Ri
    LDD R7,lo(DISP32)(Rt)

    4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum

    Any OoO machine is also likely to have a lot of RAM and a decent sized
    I$, so much of any benefit is likely to go away in this case.

    s/go away/greatly ameliorated/

    ------------------------
    ILP is a property of a program. I assume that what you mean is that
    the IPC benefits of more width have quickly diminishing returns on in-order machines.


    The ILP is a property of the code, yes, but how much exists, and how
    much of it is actually usable, is effected by the processor implementation.

    I agree that ILP is more aligned with code than with program.
    {see above example where 1 instruction does the work of 5}
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 31 19:12:25 2025
    From Newsgroup: comp.arch

    On Wed, 31 Dec 2025 15:25:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 30 Dec 2025 17:27:22 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I did not write anything about the clue of Apple. I don't know
    much about the CPUs by Nvidia and Fujitsu. But if there was
    significant performance to be had by adding a weakly-ordered mode,
    wouldn't especially Fujitsu with its supercomputer target have
    done it?

    Fujitsu had very strong reason to implement TSO on A64FX -
    source-level compatibility with SPARC64 VIIIfx and XIfx.

    Apple also had good reason to implement TSO on M1: AMD64->ARM A64
    binary translation (Rosetta). They chose to add a slower TSO mode to
    their weak memory system, which is not surprising given that they had
    a working weak memory system, and it is relatively easy to implement
    TSO on that (with a performance penalty).


    M1 = almost unmodified A14. And A14, as the one that is sold far higher quantities, is the one which pays the development bills. Which means
    that A14 was the one which calls the shots.
    The same goes for all generations of Apple M and A lines.

    I wouldn't be surprised if apart from that they have SPARC->ARM
    Rosetta for some customers, but that's relatively minor factor. >Supercomputer users are considered willing to recompile their code.
    But much less willing to re-write it.

    While supercomputer users may not be particularly willing to rewrite
    their code, they are much more willing than anyone else, because in supercomputing, hardware cost still is higher than software cost.


    I am not sure that it applies to Riken.
    Researches there probably treat Fugaku as free resource.

    If there was an easy way to offer "5-10% more performance" to those
    users willing to write or use software written for weak memory models
    by adding a weak memory mode to A64FX, I would be very surprised if
    they would have passed. So I conclude that it's not easy to turn
    their memory model into a weak one and gain performance.

    Concerning their SPARC implementations: The SPARC architecture
    specifies both TSO and a weak memory model. Does your comment about
    SPARC64 VIIIfx and XIfx mean that Fujitsu only implemented TSO on
    those CPUs and that when you asked for the weak mode on those CPUs,
    you still got TSO? That would be the counterexample that Thomas
    Koenig asked for.


    As far as I know, SPARC RMO is almost exclusively a paper tiger.
    It was supported in hardware during rather short time span (out of
    memory, only Sun UltraSPARC 2 and 3, none of Fujitsu chips) and in
    software only supported by Linux, never by Solaris.

    Besides, as I mentioned in my other post, A64fx memory subsystem is
    slow (latency-wise, throughput wise it is very good).

    Sounds to me like it is designed for a supercomputer.


    Yes, surely.
    Maximal memory capacity per node is also way too small for
    non-supercomputer use.

    I don't know what
    influence that fact has, but I can hand-wave that it shifts the
    balance of cost toward TSO.

    Can you elaborate on that?


    I can, but I will not.
    In order to elaborate I'd have to think. And even after that it still
    would be hand-waving, because I am not a specialist.
    Does not sound as a good spending of mental resources.


    Also, cache lines are unusually wide (256B), so it
    is possible that RFO shortcuts allowed by weaker MOM are less
    feasible.

    Why should that be?


    My understanding is that on general-purpose "big" ARM64 chips they have
    several 64B merge buffers and some heuristics to decide whether store
    that misses L1D initiates RFO straight away or at first tries to occupy
    merge buffer.
    With 256B lines each buffer is more expensive and probability of succes
    of the merge process is much lower.
    Also A64fx has very many cores and spends so much of available real
    estate on VPUs and interconnects. Which means that developers were much
    less willing to spend silicon on uncertain features than developers of
    not only Apple and Qualcomm cores but even of Cortex-X series.

    A particular aspect here is that RFO is rare in applications with good temporal locality. Supercomputer applications tend to have relatively
    bad temporal locality and will see RFO more often.

    - anton

    I don't disagree.
    But in supercomputer applications a common pattern is 'good temporal
    locality, excellent spatial locality, high likelihood of absence of L1D
    cache misses between merge-able stores'. Such sort of spatial locality
    can be relatively easily exploited even on TSO.
    Not that I believe that A64fx designers did it. As mentioned above, they
    had to little silicon area to spare.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 31 17:18:24 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Anton Ertl [2025-12-30 17:15:59] wrote:
    Any sequence of stores without intervening loads can be turned into
    one store under sequential consistency, and therefore also under the
    weaker TSO. Doing that for a sequence that stores into one cache line
    does not appear particularly heroic to me. The question is how much
    benefit one gets from this optimization.

    But the stores may be interleaved with loads from other locations!
    It's quite common to have a situation where a sequence of stores >initializes a new object and thus overwrites a complete cache line, but >that initialization sequence needs to read from memory (e.g. from the >stack).

    Maybe compilers can be taught to group such writes to try and avoid
    the problem?

    What compilers can do depends on the programming language. But this
    "read from stack" idea is curious. We have had good register
    allocators for several decades, so local variables tend to reside in registers, not on some stack. Parameters also tend to reside in
    registers. So if you have a C initializing function

    void init_foo(foo_t *foo, long a, long b, /* ... */)
    {
    size_t i;
    foo->a = a;
    foo->b = b;
    for (i=0; i<FOO_C_ELEMS; i++)
    foo->c[i] = 0;
    }

    it is unlikely that there will be loads between the stores.

    In other cases you can reorder the loads and stores by hand. Instead of

    a = y->a;
    b = y->b;
    ..

    you can do

    long a = y->a;
    long b = y->b;
    ..
    a = a;
    b = b;
    ..

    I have advocated for this style of programming for 40 years.
    It specifically give the compiler memory order in ways that
    the above does not. In compilers from the 1980s often results
    in SMALLER code from larger ASCII !!

    This kind of reordering only needs to be performed where it eliminates
    many RFOs, and is much easier to get correct than not-too-slow code
    for weak memory models that has to be done everywhere where shared
    written-to memory is accessed and has to be correct everywhere (and not-too-slow everywhere that is executed frequently).

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 31 17:23:37 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    MitchAlsup wrote:
    EricP <ThatWouldBeTelling@thevillage.com> posted:

    John Savard wrote:
    On Sun, 21 Dec 2025 20:32:44 +0000, MitchAlsup wrote:
    On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
    Or in other words, if you can decode K-instructions per cycle, you'd >>>>>> better be able to execute K-instructions per cycle--or you have a >>>>>> serious blockage in your pipeline.
    Not a typo--the part of the pipeline which is <dynamically> narrowest is >>>> the part that limits performance. I suggest strongly that you should not >>>> make/allow the decoder to play that part.
    I agree - and strongly, too - that the decoder ought not to be the part >>> that limits performance.

    But what I quoted says that the execution unit ought not to be the part >>> that limits performance, with the implication that it's OK if the decoder
    does instead. That's why I said it must be a typo.

    So I think you need to look a second time at what you wrote; it's natural
    for people to see what they expect to see, and so I think you looked at >>> it, and didn't see the typo that was there.

    John Savard
    There are two kinds of stalls:
    stalls in the serial front end I-cache, Fetch or Decode stages because
    of *too little work* (starvation due to input latency),
    and stalls in the back end Execute or Writeback stages because
    of *too much work* (resource exhaustion).

    DECODE latency increases when:
    a) there is no instruction(s) to decode
    b) there is no address from which to fetch
    c) when there is no translation of the fetch address

    a) is a cache miss
    b) is an indirect control transfer
    c) is a TLB miss

    And there may be additional cases of instruction buffer hiccups.

    Yes. Also Decode generated stalls - pipeline drain.
    Rename stall for new dest register pool exhaustion.

    The front end stalls inject bubbles into the pipeline,
    whereas back end stalls can allow younger bubbles to be compressed out.

    How In-Order your thinking is. GBOoO machine do not inject bubbles.

    You get bubbles if you overload their resources no matter how GB it is.

    For example, if all the reservation stations for a FU are in use then Dispatch has to stall, which stalls the whole front end.

    These are not "bubbles"
    These are "window Full" stalls

    A compacting pipeline in the front end can compress out those bubbles
    but it eventually stalls too.

    DECODE still has no place to put the instructions.

    Dependency stalls - all the uOps in reservation stations are waiting
    on other results. Serialization stalls.

    Latency stalls or you can call then RAW stalls.

    If a design is doing dynamic register file read port assignment and
    runs out of read ports. Resource exhaustion stalls.

    Yes, that is why I don't like reorder buffers so much.
    Renamers are an issue as you generally have more rename port
    requirements than Read requirements--BECAUSE DECODE has wider
    BW than the execution machinery.

    Multiple uOps are ready but only one can launch. Scheduling stalls.

    If I have to stall, I want it in the back end.

    If I have to stall I want it based on "realized" latency.

    It has to do with catching up after a stall.

    Which is why you do not inject bubbles...

    It's not me doing it. I blame the speed of light.

    It seems our verbology is not aligned.

    If a core stalls for 3 clocks, then in order to average 1 IPC
    it must retire 2 instructions per clock for the next 3 clocks.
    And it can only do that if it has a backlog of work ready to execute.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 31 19:55:28 2025
    From Newsgroup: comp.arch

    On Wed, 31 Dec 2025 15:25:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 30 Dec 2025 17:27:22 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I did not write anything about the clue of Apple. I don't know
    much about the CPUs by Nvidia and Fujitsu. But if there was
    significant performance to be had by adding a weakly-ordered mode,
    wouldn't especially Fujitsu with its supercomputer target have
    done it?

    Fujitsu had very strong reason to implement TSO on A64FX -
    source-level compatibility with SPARC64 VIIIfx and XIfx.


    BTW, can you find a proof link for A64FX being TSO.
    My understanding is that you learned it from Jonathan Corbet who in turn learned it from Hector Martin.
    But what is the source of Hector Martin?
    I certainly don't see it in Fujitsu's "A64FX Microarchitecture Manual"
    or in the Datasheet.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 31 18:14:20 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    BTW, can you find a proof link for A64FX being TSO.
    My understanding is that you learned it from Jonathan Corbet who in turn >learned it from Hector Martin.

    Correct.

    But what is the source of Hector Martin?
    I certainly don't see it in Fujitsu's "A64FX Microarchitecture Manual"
    or in the Datasheet.

    In that case, I would ask Hector Martin, if I wanted proof.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 31 20:56:49 2025
    From Newsgroup: comp.arch

    On Wed, 31 Dec 2025 18:14:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    BTW, can you find a proof link for A64FX being TSO.
    My understanding is that you learned it from Jonathan Corbet who in
    turn learned it from Hector Martin.

    Correct.

    But what is the source of Hector Martin?
    I certainly don't see it in Fujitsu's "A64FX Microarchitecture
    Manual" or in the Datasheet.

    In that case, I would ask Hector Martin, if I wanted proof.

    - anton

    I don't know Hector Martin. Is he on Usenet?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 31 19:09:35 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    I don't know Hector Martin. Is he on Usenet?

    Unlikely. <https://marcan.st/about/> lists email, IRC, and Mastodon
    as ways of contact.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Dec 31 13:03:11 2025
    From Newsgroup: comp.arch

    On 12/30/2025 9:21 PM, BGB wrote:
    On 12/30/2025 4:58 PM, Chris M. Thomasson wrote:
    On 12/30/2025 11:10 AM, BGB wrote:
    [...]

    But, then again, weak model is cheaper to implement and generally
    faster, although explicit synchronization is annoying and such a
    model is incompatible with "lock free data structures" (which tend to
    implicitly assume that memory accesses occur in the same order as
    written and that any memory stores are immediately visible across
    threads).

    Fwiw, a weak memory model is totally compatible with lock-free data
    structures. A weak model tends to have the necessary memory barriers
    to make them work. Have you ever used a SPARC in RMO mode? Acquire
    membar ala std::memory_order_acquire is basically a MEMBAR #LoadStore
    | #LoadLoad. A release is MEMBAR #LoadStore | #StoreStore. Those can
    be used for the implementation of a mutex. Notice how acquire and
    release never need #StoreLoad ordering?

    The point is that once we have this flexibility, a lock/wait free algo
    can use the right membars for the job. Ideally, the weakest membars
    they can use to ensure they are correct in their logic.


    Usually IME the people writing lock-free code don't use memory barriers
    or similar though. A lot of times IME, it is people just using volatile
    or similar and trying to write things in a way that it (hopefully) wont
    go terribly wrong if two threads hit the same data at the same time.

    Like, the sort of code that works on a PC running Windows or similar,
    but try to port it to Linux on an ARM machine, and it explodes.


    Where, say, using volatile isn't sufficient for multiple cores with a
    weak model. One would need either to use barriers (though, in my case, barriers will also be slow), non-cached memory accesses, or explicit cache-line flushing.


    In this case, this leaves it often preferable to use bulk mostly read-
    only data sharing. Or, passing along data via buffers or messages (with
    some level of basic flow control).

    So, not so much "lets have two threads share a doubly-linked list and
    hope it doesn't all turn into a train wreck", and more "will copy
    messages onto the end of a circular buffer and advance the roving
    pointers; manually flushing the lines corresponding to the parts of the buffer than have been updated in the process".

    Say, for example:
    -a void _flushbuffer(void *data, size_t sz)
    -a {
    -a-a-a char *ct, *cte;
    -a-a-a ct=data; cte=ct+sz;
    -a-a-a while(ct<cte)
    -a-a-a-a-a { __mem_flushaddr(ct); ct+=LINESIZE; }
    -a }
    -a void _memcpyout_flush(void *dst, void *src, size_t sz)
    -a {
    -a-a-a memcpy(dst, src, sz);
    -a-a-a _flushbuffer(dst, sz);
    -a }
    -a void _memcpyin_flush(void *dst, void *src, size_t sz)
    -a {
    -a-a-a _flushbuffer(src, sz);
    -a-a-a memcpy(dst, src, sz);
    -a }
    -a void _memcpy_flush(void *dst, void *src, size_t sz)
    -a {
    -a-a-a _flushbuffer(src, sz);
    -a-a-a memcpy(dst, src, sz);
    -a-a-a _flushbuffer(dst, sz);
    -a }

    Where, in this case, normal memcpy + flushing is likely to be faster in
    many cases than using non-cached memory.

    Huh? Humm... This is way out of bounds. Yikes. I am talking about
    knowing when to use the right membars in the right places. Are you at
    least familiar with std::memory_order_* in C++?

    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Dec 31 13:11:19 2025
    From Newsgroup: comp.arch

    On 12/30/2025 3:51 PM, Chris M. Thomasson wrote:
    On 12/30/2025 12:08 PM, MitchAlsup wrote:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Michael S wrote:
    On Tue, 30 Dec 2025 10:44:10 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    My 66000 does not have a TSO memory system, but when one of these >>>>>>>> things shows up, it goes sequential consistency, and when it is >>>>>>>> done it flips back to causal consistency.

    TSO is cycle-wasteful.
    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf >>>>>> WOW, they wrote article of 7 pages without even one time mentioning >>>>>> avoidance of RFO (read for ownership) which is an elephant in the
    room of discussion of advantages of Arm MOM/MCM over TSO.
    What is this "avoidance of RFO"?
    I can find no mention of it anywhere.

    Imagine code that overwrites the whole cache line that core initially
    did not own. Under TSO rules, like x86, the only [non heroic] ways to
    overwrite the line without reading its previous content (which could
    easily mean reading from DRAM) are
    - aligned AVX512 store
    - rep movs/rep stos

    With weaker MOM the core has option of delayed merging of multiple
    narrow stores. I think that even relatively old ARM cores, like
    Neoverse
    N1, are able to do it.

    I can imagine heroic microarchitecture that achieves the same effect
    with TSO, but it seems that so far nobody did it.

    I don't see how a ReadForOwnership message can be avoided as it
    transfers two things: the ownership state, and the current line data.

    InvalidateForOwnership {i.e., CI followed by Allocate Cache line and
    start writing}

    Even if the core knows the whole cache line is being overwritten and
    doesn't need the line data, it still needs the Owned state transfer.

    Which it can get by telling everyone else to loose that cache line.

    There would still be a request message, say TakeOwner TKO which
    has a smaller reply GiveOwner GVO message and just moves the state.
    So the reply is a few less flits.

    As I understand it...
    Independent of the ReadForOwnership message, the ARM weak coherence
    model
    should allow stores to other cache lines to proceed, whereas TSO would
    require younger stores to (appear to) wait until the older store
    completes.

    Which is why TSO is cycle wasteful.

    Wrt to TSO aka x86/x64..., even that is NOT strong enough to get
    #StoreLoad, aka ordering a store followed by a load to another location.
    You need a LOCK'ed RMW or the MFENCE instruction.



    The weak coherence model allows the cache to use hit-under-miss for
    stores because it doesn't require the store order to different locations >>> be seen in program order. This allows it to overlap younger store cache
    hits with the older ReadForOwnership message, not eliminate it.




    Think of something simple. You want to publish a pointer to other threads.

    foo* global = nullptr;

    producer:
    foo* p = create();
    membar_release();
    atomic_store(&global, p);


    consumers:
    foo* p = atomic_load(&global);
    if (p)
    {
    membar_acquire();
    p->bar();
    }


    A simple pattern...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Dec 31 13:13:06 2025
    From Newsgroup: comp.arch

    On 12/31/2025 1:11 PM, Chris M. Thomasson wrote:
    On 12/30/2025 3:51 PM, Chris M. Thomasson wrote:
    On 12/30/2025 12:08 PM, MitchAlsup wrote:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Michael S wrote:
    On Tue, 30 Dec 2025 10:44:10 -0500
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Michael S wrote:
    On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    My 66000 does not have a TSO memory system, but when one of these >>>>>>>>> things shows up, it goes sequential consistency, and when it is >>>>>>>>> done it flips back to causal consistency.

    TSO is cycle-wasteful.
    This has been experimentally verified for Apple Silicon:

    https://www.sra.uni-hannover.de/Publications/2024/
    wrenger_24_jsa.pdf
    WOW, they wrote article of 7 pages without even one time mentioning >>>>>>> avoidance of RFO (read for ownership) which is an elephant in the >>>>>>> room of discussion of advantages of Arm MOM/MCM over TSO.
    What is this "avoidance of RFO"?
    I can find no mention of it anywhere.

    Imagine code that overwrites the whole cache line that core initially >>>>> did not own. Under TSO rules, like x86, the only [non heroic] ways to >>>>> overwrite the line without reading its previous content (which could >>>>> easily mean reading from DRAM) are
    - aligned AVX512 store
    - rep movs/rep stos

    With weaker MOM the core has option of delayed merging of multiple
    narrow stores. I think that even relatively old ARM cores, like
    Neoverse
    N1, are able to do it.

    I can imagine heroic microarchitecture that achieves the same effect >>>>> with TSO, but it seems that so far nobody did it.

    I don't see how a ReadForOwnership message can be avoided as it
    transfers two things: the ownership state, and the current line data.

    InvalidateForOwnership {i.e., CI followed by Allocate Cache line and
    start writing}

    Even if the core knows the whole cache line is being overwritten and
    doesn't need the line data, it still needs the Owned state transfer.

    Which it can get by telling everyone else to loose that cache line.

    There would still be a request message, say TakeOwner TKO which
    has a smaller reply GiveOwner GVO message and just moves the state.
    So the reply is a few less flits.

    As I understand it...
    Independent of the ReadForOwnership message, the ARM weak coherence
    model
    should allow stores to other cache lines to proceed, whereas TSO would >>>> require younger stores to (appear to) wait until the older store
    completes.

    Which is why TSO is cycle wasteful.

    Wrt to TSO aka x86/x64..., even that is NOT strong enough to get
    #StoreLoad, aka ordering a store followed by a load to another
    location. You need a LOCK'ed RMW or the MFENCE instruction.



    The weak coherence model allows the cache to use hit-under-miss for
    stores because it doesn't require the store order to different
    locations
    be seen in program order. This allows it to overlap younger store cache >>>> hits with the older ReadForOwnership message, not eliminate it.




    Think of something simple. You want to publish a pointer to other threads.

    foo* global = nullptr;

    producer:
    -a-a foo* p = create();
    -a-a membar_release();
    -a-a atomic_store(&global, p);


    consumers:
    -a-a foo* p = atomic_load(&global);
    -a-a if (p)
    -a-a {
    -a-a-a-a-a-a membar_acquire();
    -a-a-a-a-a-a p->bar();
    -a-a }


    A simple pattern...


    Now in this case, depending on what your needs are, a membar_consume()
    can be used instead of acquire. I have a LOT of experience with these.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Dec 31 23:01:35 2025
    From Newsgroup: comp.arch

    On 2025-12-31 12:12 p.m., MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    BGB <cr88192@gmail.com> posted:

    On 12/30/2025 1:36 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller
    register space, one needs to reuse registers more frequently, which then >>>>> reduces ILP due to register conflicts. So, smaller code at the expense >>>>> of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to
    use the uncompressed instruction. So you may tune your RISC-V
    compiler to prefer registers r8-r15 for those pseudo-registers that
    occur in instructions where such a register allocation may lead to a
    compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of OoO
    implementations. On the contrary, write-after-read may be beneficial
    by releasing the old physical register for the register name. And
    designing a compressed CPU instruction set for in-order processing is
    not a good idea for general-purpose computing.


    Though, the main places where compressed instructions are likely to
    bring meaningful benefit, is on small in-order machines.

    Coincidentally; this is exactly where a fatter-ISA wins big::
    compare::

    LDD R7,[IP,R3<<3,DISP32]

    1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against

    It is only 2 words

    AUPIC Rt,lo(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,hi(DISP32)
    ADD Rt,Rt,Ri
    LDD R7,0(Rt)

    5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum.

    This should be::

    AUPIC Rt,hi(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,Ri
    LDD R7,lo(DISP32)(Rt)

    4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum

    An even fatter ISA (Qupls4) in theory:

    LOAD r7, disp56(ip+r3*8)

    1 instruction + 1 postfix = 2 words (96 bits) 1 cycle + cache hit minimum

    The ISA is becoming a bit more stable now; the latest change was for
    constant postfix instructions. Qupls used to have a somewhat convoluted
    means of addressing constants on the cache-line. Now itrCOs just
    postfixes. The constant routing information is in the postfix now which
    uses four bits. Two to select a register override, two to select
    constant quadrant. So, postfixes extend constants in the instruction (or previous postfix) by 36 bits.

    Qupls can do
    ADD r7, r8, $64_bit_constant

    Using only two words (96 bits) and just a single cycle.

    I prefer to use multiply rCy*rCO rather than shift in scaled indexed addressing as a couple of CPUs had multiply by five and ten in addition
    to 1,2,4,8. What if one wants to scale by 3?

    It is also possible to encode 128-bit constants, but the current implementation does not support them.

    Managed to get to some early synthesis trials and found the instruction dispatch to be on the critical timing path. I am a bit stumped as to how
    to improve it as it is very simple already. It just copies from one set
    of pipeline registers to another headed towards the reservation
    stations. Tools report timing good to 37 MHz, I was shooting for at
    least 40.

    Found a couple of spots where the code was simple but too slow. One in
    dynamic register selection. The code was packing the register selections
    to a minimum. But that was way too many logic levels.

    It is quite an art to get something working in minimum clock cycles and
    fast clock frequency.


    Any OoO machine is also likely to have a lot of RAM and a decent sized
    I$, so much of any benefit is likely to go away in this case.

    s/go away/greatly ameliorated/

    ------------------------
    ILP is a property of a program. I assume that what you mean is that
    the IPC benefits of more width have quickly diminishing returns on
    in-order machines.


    The ILP is a property of the code, yes, but how much exists, and how
    much of it is actually usable, is effected by the processor implementation. >>
    I agree that ILP is more aligned with code than with program.
    {see above example where 1 instruction does the work of 5}

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jan 1 01:07:03 2026
    From Newsgroup: comp.arch

    On 12/31/2025 3:03 PM, Chris M. Thomasson wrote:
    On 12/30/2025 9:21 PM, BGB wrote:
    On 12/30/2025 4:58 PM, Chris M. Thomasson wrote:
    On 12/30/2025 11:10 AM, BGB wrote:
    [...]

    But, then again, weak model is cheaper to implement and generally
    faster, although explicit synchronization is annoying and such a
    model is incompatible with "lock free data structures" (which tend
    to implicitly assume that memory accesses occur in the same order as
    written and that any memory stores are immediately visible across
    threads).

    Fwiw, a weak memory model is totally compatible with lock-free data
    structures. A weak model tends to have the necessary memory barriers
    to make them work. Have you ever used a SPARC in RMO mode? Acquire
    membar ala std::memory_order_acquire is basically a MEMBAR #LoadStore
    | #LoadLoad. A release is MEMBAR #LoadStore | #StoreStore. Those can
    be used for the implementation of a mutex. Notice how acquire and
    release never need #StoreLoad ordering?

    The point is that once we have this flexibility, a lock/wait free
    algo can use the right membars for the job. Ideally, the weakest
    membars they can use to ensure they are correct in their logic.


    Usually IME the people writing lock-free code don't use memory
    barriers or similar though. A lot of times IME, it is people just
    using volatile or similar and trying to write things in a way that it
    (hopefully) wont go terribly wrong if two threads hit the same data at
    the same time.

    Like, the sort of code that works on a PC running Windows or similar,
    but try to port it to Linux on an ARM machine, and it explodes.


    Where, say, using volatile isn't sufficient for multiple cores with a
    weak model. One would need either to use barriers (though, in my case,
    barriers will also be slow), non-cached memory accesses, or explicit
    cache-line flushing.


    In this case, this leaves it often preferable to use bulk mostly read-
    only data sharing. Or, passing along data via buffers or messages
    (with some level of basic flow control).

    So, not so much "lets have two threads share a doubly-linked list and
    hope it doesn't all turn into a train wreck", and more "will copy
    messages onto the end of a circular buffer and advance the roving
    pointers; manually flushing the lines corresponding to the parts of
    the buffer than have been updated in the process".

    Say, for example:
    -a-a void _flushbuffer(void *data, size_t sz)
    -a-a {
    -a-a-a-a char *ct, *cte;
    -a-a-a-a ct=data; cte=ct+sz;
    -a-a-a-a while(ct<cte)
    -a-a-a-a-a-a { __mem_flushaddr(ct); ct+=LINESIZE; }
    -a-a }
    -a-a void _memcpyout_flush(void *dst, void *src, size_t sz)
    -a-a {
    -a-a-a-a memcpy(dst, src, sz);
    -a-a-a-a _flushbuffer(dst, sz);
    -a-a }
    -a-a void _memcpyin_flush(void *dst, void *src, size_t sz)
    -a-a {
    -a-a-a-a _flushbuffer(src, sz);
    -a-a-a-a memcpy(dst, src, sz);
    -a-a }
    -a-a void _memcpy_flush(void *dst, void *src, size_t sz)
    -a-a {
    -a-a-a-a _flushbuffer(src, sz);
    -a-a-a-a memcpy(dst, src, sz);
    -a-a-a-a _flushbuffer(dst, sz);
    -a-a }

    Where, in this case, normal memcpy + flushing is likely to be faster
    in many cases than using non-cached memory.

    Huh? Humm... This is way out of bounds. Yikes. I am talking about
    knowing when to use the right membars in the right places. Are you at
    least familiar with std::memory_order_* in C++?


    I haven't really looked into C++ here.
    As can be noted, currently BGBCC only really supports C.
    Had started trying to add C++, but mostly fizzled out at early 1990s
    levels (and I tend to not really use C++ all that often).


    But, as can be noted:
    The way I implemented stuff in my CPU doesn't really have hardware-level support for memory barriers either.

    Basically, the only things available are:
    No-cache memory accesses;
    Explicit cache flushing.

    The cache flushing isn't automatic, rather it requires using some sort
    of loop to evict cache lines. If you use FENCE or similar, it is trap-and-emulate, but this is slow. Could have memory barriers as
    runtime calls, or as a language level feature, but they still have the
    same inherent limitations.

    Arguably, yes, kinda poor, but was cheap-ish to implement.

    Maybe it is "weak model", or maybe some sort of "extra weak" model, dunno...



    Well, can't say too much more at the moment. As part of new-year's
    thing, drank a bit of alcohol, and am kinda feeling the effects (so, not
    in the right cognitive state at the moment for much technical thought;
    makes thinking about stuff a little more difficult).

    Well, errm, Happy New Year's, everyone...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 1 18:13:13 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-12-31 12:12 p.m., MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    BGB <cr88192@gmail.com> posted:

    On 12/30/2025 1:36 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller
    register space, one needs to reuse registers more frequently, which then
    reduces ILP due to register conflicts. So, smaller code at the expense >>>>> of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to
    use the uncompressed instruction. So you may tune your RISC-V
    compiler to prefer registers r8-r15 for those pseudo-registers that
    occur in instructions where such a register allocation may lead to a >>>> compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of OoO >>>> implementations. On the contrary, write-after-read may be beneficial >>>> by releasing the old physical register for the register name. And
    designing a compressed CPU instruction set for in-order processing is >>>> not a good idea for general-purpose computing.


    Though, the main places where compressed instructions are likely to
    bring meaningful benefit, is on small in-order machines.

    Coincidentally; this is exactly where a fatter-ISA wins big::
    compare::

    LDD R7,[IP,R3<<3,DISP32]

    1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against

    It is only 2 words

    AUPIC Rt,lo(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,hi(DISP32)
    ADD Rt,Rt,Ri
    LDD R7,0(Rt)

    5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum.

    This should be::

    AUPIC Rt,hi(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,Ri
    LDD R7,lo(DISP32)(Rt)

    4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum

    An even fatter ISA (Qupls4) in theory:

    LOAD r7, disp56(ip+r3*8)

    I could have shown the DISP64 version--3-words

    1 instruction + 1 postfix = 2 words (96 bits) 1 cycle + cache hit minimum

    The ISA is becoming a bit more stable now; the latest change was for constant postfix instructions. Qupls used to have a somewhat convoluted means of addressing constants on the cache-line. Now itrCOs just
    postfixes. The constant routing information is in the postfix now which
    uses four bits. Two to select a register override, two to select
    constant quadrant. So, postfixes extend constants in the instruction (or previous postfix) by 36 bits.

    Qupls can do
    ADD r7, r8, $64_bit_constant

    Using only two words (96 bits) and just a single cycle.

    So can My 66000, but everyone and his brother thinks 96-bits is 3 words.

    I prefer to use multiply rCy*rCO rather than shift in scaled indexed addressing as a couple of CPUs had multiply by five and ten in addition
    to 1,2,4,8. What if one wants to scale by 3?

    If you have the bits, why not.

    It is also possible to encode 128-bit constants, but the current implementation does not support them.

    Managed to get to some early synthesis trials and found the instruction dispatch to be on the critical timing path. I am a bit stumped as to how
    to improve it as it is very simple already. It just copies from one set
    of pipeline registers to another headed towards the reservation
    stations. Tools report timing good to 37 MHz, I was shooting for at
    least 40.

    Found a couple of spots where the code was simple but too slow. One in dynamic register selection. The code was packing the register selections
    to a minimum. But that was way too many logic levels.

    Those are some of the driving inputs to "An architecture is as much about
    what gets left out as what gets put in."

    It is quite an art to get something working in minimum clock cycles and
    fast clock frequency.


    Any OoO machine is also likely to have a lot of RAM and a decent sized >>> I$, so much of any benefit is likely to go away in this case.

    s/go away/greatly ameliorated/

    ------------------------
    ILP is a property of a program. I assume that what you mean is that >>>> the IPC benefits of more width have quickly diminishing returns on
    in-order machines.


    The ILP is a property of the code, yes, but how much exists, and how
    much of it is actually usable, is effected by the processor implementation.

    I agree that ILP is more aligned with code than with program.
    {see above example where 1 instruction does the work of 5}

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jan 1 17:17:30 2026
    From Newsgroup: comp.arch

    On 1/1/2026 12:13 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-12-31 12:12 p.m., MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    BGB <cr88192@gmail.com> posted:

    On 12/30/2025 1:36 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller >>>>>>> register space, one needs to reuse registers more frequently, which then
    reduces ILP due to register conflicts. So, smaller code at the expense >>>>>>> of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to >>>>>> use the uncompressed instruction. So you may tune your RISC-V
    compiler to prefer registers r8-r15 for those pseudo-registers that >>>>>> occur in instructions where such a register allocation may lead to a >>>>>> compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of OoO >>>>>> implementations. On the contrary, write-after-read may be beneficial >>>>>> by releasing the old physical register for the register name. And >>>>>> designing a compressed CPU instruction set for in-order processing is >>>>>> not a good idea for general-purpose computing.


    Though, the main places where compressed instructions are likely to
    bring meaningful benefit, is on small in-order machines.

    Coincidentally; this is exactly where a fatter-ISA wins big::
    compare::

    LDD R7,[IP,R3<<3,DISP32]

    1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against

    It is only 2 words

    AUPIC Rt,lo(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,hi(DISP32)
    ADD Rt,Rt,Ri
    LDD R7,0(Rt)

    5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum. >>>
    This should be::

    AUPIC Rt,hi(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,Ri
    LDD R7,lo(DISP32)(Rt)

    4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum

    An even fatter ISA (Qupls4) in theory:

    LOAD r7, disp56(ip+r3*8)

    I could have shown the DISP64 version--3-words


    At 64-bits, displacements cease to make sense as a displacement.
    Seems to make more sense to interpret these as [Abs64+Rb] rather than [Rb+Disp64].

    Except, then I have to debate what exactly I would do if I decide to
    allow this case in XG2/XG3.


    As noted:
    [Rb+Disp10]: Usually scaled (excluding some special-cases);
    [Rb+Disp33]: Either scaled or unscaled.
    BGBCC is typically using unscaled displacements in this case.
    Uscaled range, +/- 4GB
    DW: +/- 16GB, QW: +/- 32GB
    XG2 and XG3 effectively left 1 bit extra, which indicates scale.
    0: Scaled by element size;
    1: Unscaled.
    [Rb+Disp64]: Would be understood as unscaled.
    TBD: Scale register (more likely to be useful, breaks symmetry);
    Unscaled register, preserves symmetry, but less likely useful.
    Would be consistent with the handling of RISC-V,
    which is always unscaled in this case.
    May be moot, as plain Abs64 would be the dominant case here.


    1 instruction + 1 postfix = 2 words (96 bits) 1 cycle + cache hit minimum

    The ISA is becoming a bit more stable now; the latest change was for
    constant postfix instructions. Qupls used to have a somewhat convoluted
    means of addressing constants on the cache-line. Now itrCOs just
    postfixes. The constant routing information is in the postfix now which
    uses four bits. Two to select a register override, two to select
    constant quadrant. So, postfixes extend constants in the instruction (or
    previous postfix) by 36 bits.

    Qupls can do
    ADD r7, r8, $64_bit_constant

    Using only two words (96 bits) and just a single cycle.

    So can My 66000, but everyone and his brother thinks 96-bits is 3 words.


    So can XG2 and XG3.
    And, now, can add RV+JX to this category.

    Though, I am likely to still consider 96-bit ops as an extension of JX
    (as supporting them would be a much bigger burden on a 2-wide machine
    with a 64-bit instruction fetch; would require a 2-wide machine to still support 96-bit fetch).

    Well, and then there is another issue:
    RV64GC + 96-bit encodings, reintroduces another potential problem that
    existed in XG1:
    At certain alignments, the 96-bit fetch can cross a boundary of 2
    half-line fetches with a 16B line size.

    Say, one letter per 16-bit word:
    AAAA-BBBB //Line A
    CCCC-DDDD //Line B
    Then (low 4b of PC):
    0: AAAABB
    2: AAABBB
    4: AABBBB
    6: ABBBBC //Violates two half-lines
    8: BBBBCC
    A: BBBCCC
    C: BBCCCC
    E: BCCCCD //Violates two half-lines

    Granted, the partial workaround is to fetch 144 bits internally (16-bits
    past the end of the half-line); which does technically "fix" the problem
    as far as architecturally-visible behavior is concerned.

    Or, just use the same "small brain" trick that BGBCC had used:
    If free-form variable length instructions, insert a NOP pad if we step
    on this turd;
    Or, for code sequences where this turd would be unavoidable (running
    through the WEXifier): Realign to 32 bits before entering WEX encoding (scenario can't happen if 32-bit aligned).


    Arguably, the latter scenario wouldn't have applied to RISC-V (and my JX encodings), except that (very recently) I did end up expanding BGBCC's WEXifier mechanism to cover RISC-V and XG3 (even if its role is slightly different in this case), but does technically reintroduce the issue it targeting RV64GC.

    Though, currently, it is only enabled for RV64 if using RV64G and speed optimization.

    In this case, since RV64G and XG3 don't use explicit bundling, its role
    is instead to shuffle instructions to try to optimize how they fit in
    the pipeline.



    I prefer to use multiply rCy*rCO rather than shift in scaled indexed
    addressing as a couple of CPUs had multiply by five and ten in addition
    to 1,2,4,8. What if one wants to scale by 3?

    If you have the bits, why not.


    Higher resource cost and latency is a concern...


    It is also possible to encode 128-bit constants, but the current
    implementation does not support them.

    Managed to get to some early synthesis trials and found the instruction
    dispatch to be on the critical timing path. I am a bit stumped as to how
    to improve it as it is very simple already. It just copies from one set
    of pipeline registers to another headed towards the reservation
    stations. Tools report timing good to 37 MHz, I was shooting for at
    least 40.

    Found a couple of spots where the code was simple but too slow. One in
    dynamic register selection. The code was packing the register selections
    to a minimum. But that was way too many logic levels.

    Those are some of the driving inputs to "An architecture is as much about what gets left out as what gets put in."


    Some amount of stuff I had added has ended up getting pruned again.

    In my case, the FPGA doesn't get bigger, nor faster.
    Adding a feature in one place may mean needing to drop some other
    lesser-used feature to free up resources or improve timing.

    Sometimes, the features don't get entirely removed from the ISA, but
    instead become a sort of lesser used "secondary feature":
    May be supported in hardware;
    Or, may be supported with trap-and-emulate.


    Sometimes the features dropped are subtle edge cases, say:
    For JAL and JALR in RISC-V:
    Is Xd ever anything other than X0 or X1?
    In theory, yes.
    In practice: Not so much.
    Starts to make sense to hard-wire the hardware support to only allow X0
    and X1 and to treat other cases as trap-and-emulate.

    Or, some other stuff within RV64G:
    In general, makes sense, but even within RV64G there is stuff that is debatable whether it makes sense to try to support natively in hardware.


    Like, the ideal version of the RV64 ISA in HW might look like:
    RV64I:
    Limit Xd for JAL and JALR to X0 and X1;
    If not X0/X1, trap-emulate.
    Mostly, feature-set of 'I' is "mostly sane".
    Well, excluding the encoding space inefficiency of JAL/LUI/AUIPC.
    Cheaper impl, makes sense to hard-wire Bcc's Rs2 to X0;
    Absent compiler support, very bad with trap-emulate.
    M:
    MULW makes sense;
    Kinda need a DMULW/DMULWU (32-bit widening multiply)
    DIVW: Used enough to be justifiable.
    MUL/DIV/REM: 64 bit forms not used often enough to justify.
    But, also not quite rare enough for trap-and-emulate.
    A:
    Better relegated to trap-and-emulate.
    F:
    Kinda unavoidable
    Would have preferred a Zfmin+D style approach as minimal case.
    D:
    It is what it is;
    FDIV.D and FSQRT.D and similar can be turned into traps.
    Zicsr:
    Most cases can trap (except what HW actually needs to support);
    Zifence:
    Trap-and-emulate.
    ...


    Almost making sense to have a sort of system-level trap-vector-table:
    TVTB: Trap Vector Table Base;
    JTVT: Jump to Trap-Vector-Table.

    Table would primarily be used for trap-and-emulate, where it could
    become desirable to have a 32-bit encoding here. These would be assumed illegal in the user level ISA.


    One possibility for encoding could, ironically, be to overload JAL or
    similar:
    JAL Disp20, X2 => JTVT Disp20
    Branching to TVTB+Disp20 rather than to PC+Disp20 (and then stomping the
    stack with the LR, can safely assume that this case is otherwise
    invalid...). Implicit assumption that one has under 1MB of handler thunks.


    But, dunno.


    ...


    It is quite an art to get something working in minimum clock cycles and
    fast clock frequency.


    Any OoO machine is also likely to have a lot of RAM and a decent sized >>>>> I$, so much of any benefit is likely to go away in this case.

    s/go away/greatly ameliorated/

    ------------------------
    ILP is a property of a program. I assume that what you mean is that >>>>>> the IPC benefits of more width have quickly diminishing returns on >>>>>> in-order machines.


    The ILP is a property of the code, yes, but how much exists, and how >>>>> much of it is actually usable, is effected by the processor implementation.

    I agree that ILP is more aligned with code than with program.
    {see above example where 1 instruction does the work of 5}


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Jan 1 23:17:04 2026
    From Newsgroup: comp.arch

    On 2026-01-01 6:17 p.m., BGB wrote:
    On 1/1/2026 12:13 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-12-31 12:12 p.m., MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    BGB <cr88192@gmail.com> posted:

    On 12/30/2025 1:36 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller >>>>>>>> register space, one needs to reuse registers more frequently, >>>>>>>> which then
    reduces ILP due to register conflicts. So, smaller code at the >>>>>>>> expense
    of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to >>>>>>> use the uncompressed instruction.-a So you may tune your RISC-V
    compiler to prefer registers r8-r15 for those pseudo-registers that >>>>>>> occur in instructions where such a register allocation may lead to a >>>>>>> compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of >>>>>>> OoO
    implementations.-a On the contrary, write-after-read may be
    beneficial
    by releasing the old physical register for the register name.-a And >>>>>>> designing a compressed CPU instruction set for in-order
    processing is
    not a good idea for general-purpose computing.


    Though, the main places where compressed instructions are likely to >>>>>> bring meaningful benefit, is on small in-order machines.

    Coincidentally; this is exactly where a fatter-ISA wins big::
    compare::

    -a-a-a-a-a-a LDD-a-a-a R7,[IP,R3<<3,DISP32]

    1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against >>>>
    It is only 2 words
    -a-a-a-a-a-a AUPIC-a Rt,lo(DISP32)
    -a-a-a-a-a-a SLL-a-a-a Ri,R3,#3
    -a-a-a-a-a-a ADD-a-a-a Rt,Rt,hi(DISP32)
    -a-a-a-a-a-a ADD-a-a-a Rt,Rt,Ri
    -a-a-a-a-a-a LDD-a-a-a R7,0(Rt)

    5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit
    minimum.

    This should be::

    -a-a-a-a-a-a AUPIC-a-a Rt,hi(DISP32)
    -a-a-a-a-a-a SLL-a-a-a-a Ri,R3,#3
    -a-a-a-a-a-a ADD-a-a-a-a Rt,Rt,Ri
    -a-a-a-a-a-a LDD-a-a-a-a R7,lo(DISP32)(Rt)

    4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum >>>
    An even fatter ISA (Qupls4) in theory:

    LOAD r7, disp56(ip+r3*8)

    I could have shown the DISP64 version--3-words

    At 64-bits, displacements cease to make sense as a displacement.
    Seems to make more sense to interpret these as [Abs64+Rb] rather than [Rb+Disp64].

    Except, then I have to debate what exactly I would do if I decide to
    allow this case in XG2/XG3.


    As noted:
    -a [Rb+Disp10]: Usually scaled (excluding some special-cases);
    -a [Rb+Disp33]: Either scaled or unscaled.
    -a-a-a BGBCC is typically using unscaled displacements in this case.
    -a-a-a-a-a Uscaled range, +/- 4GB
    -a-a-a-a-a DW: +/- 16GB, QW: +/- 32GB
    -a-a-a XG2 and XG3 effectively left 1 bit extra, which indicates scale.
    -a-a-a-a-a 0: Scaled by element size;
    -a-a-a-a-a 1: Unscaled.
    -a [Rb+Disp64]: Would be understood as unscaled.
    -a-a-a TBD: Scale register (more likely to be useful, breaks symmetry);
    -a-a-a Unscaled register, preserves symmetry, but less likely useful.
    -a-a-a-a-a Would be consistent with the handling of RISC-V,
    -a-a-a-a-a-a-a which is always unscaled in this case.
    -a-a-a May be moot, as plain Abs64 would be the dominant case here.


    1 instruction + 1 postfix = 2 words (96 bits) 1 cycle + cache hit
    minimum

    The ISA is becoming a bit more stable now; the latest change was for
    constant postfix instructions. Qupls used to have a somewhat convoluted
    means of addressing constants on the cache-line. Now itrCOs just
    postfixes. The constant routing information is in the postfix now which
    uses four bits. Two to select a register override, two to select
    constant quadrant. So, postfixes extend constants in the instruction (or >>> previous postfix) by 36 bits.

    Qupls can do
    ADD r7, r8, $64_bit_constant

    Using only two words (96 bits) and just a single cycle.

    So can My 66000, but everyone and his brother thinks 96-bits is 3 words.


    So can XG2 and XG3.
    -a And, now, can add RV+JX to this category.

    Though, I am likely to still consider 96-bit ops as an extension of JX
    (as supporting them would be a much bigger burden on a 2-wide machine
    with a 64-bit instruction fetch; would require a 2-wide machine to still support 96-bit fetch).

    Well, and then there is another issue:
    RV64GC + 96-bit encodings, reintroduces another potential problem that existed in XG1:
    At certain alignments, the 96-bit fetch can cross a boundary of 2 half-
    line fetches with a 16B line size.

    Say, one letter per 16-bit word:
    -a AAAA-BBBB-a //Line A
    -a CCCC-DDDD-a //Line B
    Then (low 4b of PC):
    -a 0: AAAABB
    -a 2: AAABBB
    -a 4: AABBBB
    -a 6: ABBBBC //Violates two half-lines
    -a 8: BBBBCC
    -a A: BBBCCC
    -a C: BBCCCC
    -a E: BCCCCD //Violates two half-lines

    Granted, the partial workaround is to fetch 144 bits internally (16-bits past the end of the half-line); which does technically "fix" the problem
    as far as architecturally-visible behavior is concerned.


    I would just make the fetch wider. Then use the extra fetch capacity to
    buffer instructions in a nano-cache. 16B is narrow. Qupls4 fetches 128B.
    Max instruction spanning is 48B, so there is always guaranteed to be room.
    Mod 3 fetches could also be done similar to even/odd fetches, requires
    another cache bank though.

    Or, just use the same "small brain" trick that BGBCC had used:
    If free-form variable length instructions, insert a NOP pad if we step
    on this turd;
    Or, for code sequences where this turd would be unavoidable (running
    through the WEXifier): Realign to 32 bits before entering WEX encoding (scenario can't happen if 32-bit aligned).


    Arguably, the latter scenario wouldn't have applied to RISC-V (and my JX encodings), except that (very recently) I did end up expanding BGBCC's WEXifier mechanism to cover RISC-V and XG3 (even if its role is slightly different in this case), but does technically reintroduce the issue it targeting RV64GC.

    Though, currently, it is only enabled for RV64 if using RV64G and speed optimization.

    In this case, since RV64G and XG3 don't use explicit bundling, its role
    is instead to shuffle instructions to try to optimize how they fit in
    the pipeline.



    I prefer to use multiply rCy*rCO rather than shift in scaled indexed
    addressing as a couple of CPUs had multiply by five and ten in addition
    to 1,2,4,8. What if one wants to scale by 3?

    If you have the bits, why not.

    Higher resource cost and latency is a concern...

    There is not much in the AGEN so the latency for a small multiplier is probably okay. I was thinking of supporting multiply by 3, handy for
    RGB888 values, and multiply by six which is the size of an instruction.
    Maybe multiply by any value from 1 to 8 using three-bit encoding. It
    would probably be okay to use another bit for scaling.

    There are hundreds of DSP slices available, and only about 30 in use.
    Been wondering how to make good use of them.


    It is also possible to encode 128-bit constants, but the current
    implementation does not support them.

    Managed to get to some early synthesis trials and found the instruction
    dispatch to be on the critical timing path. I am a bit stumped as to how >>> to improve it as it is very simple already. It just copies from one set
    of pipeline registers to another headed towards the reservation
    stations. Tools report timing good to 37 MHz, I was shooting for at
    least 40.

    Found a couple of spots where the code was simple but too slow. One in
    dynamic register selection. The code was packing the register selections >>> to a minimum. But that was way too many logic levels.

    Those are some of the driving inputs to "An architecture is as much about
    what gets left out as what gets put in."

    Some amount of stuff I had added has ended up getting pruned again.

    In my case, the FPGA doesn't get bigger, nor faster.
    Adding a feature in one place may mean needing to drop some other lesser-used feature to free up resources or improve timing.

    Sometimes, the features don't get entirely removed from the ISA, but
    instead become a sort of lesser used "secondary feature":
    -a May be supported in hardware;
    -a Or, may be supported with trap-and-emulate.


    Sometimes the features dropped are subtle edge cases, say:
    -a For JAL and JALR in RISC-V:
    -a-a-a Is Xd ever anything other than X0 or X1?
    -a-a-a In theory, yes.
    -a-a-a In practice: Not so much.
    Starts to make sense to hard-wire the hardware support to only allow X0
    and X1 and to treat other cases as trap-and-emulate.

    Or, some other stuff within RV64G:
    In general, makes sense, but even within RV64G there is stuff that is debatable whether it makes sense to try to support natively in hardware.


    Like, the ideal version of the RV64 ISA in HW might look like:
    -a RV64I:
    -a-a-a Limit Xd for JAL and JALR to X0 and X1;
    -a-a-a-a-a If not X0/X1, trap-emulate.
    -a-a-a Mostly, feature-set of 'I' is "mostly sane".
    -a-a-a-a-a Well, excluding the encoding space inefficiency of JAL/LUI/AUIPC.
    -a-a-a Cheaper impl, makes sense to hard-wire Bcc's Rs2 to X0;
    -a-a-a-a-a Absent compiler support, very bad with trap-emulate.
    -a M:
    -a-a-a MULW makes sense;
    -a-a-a-a-a Kinda need a DMULW/DMULWU (32-bit widening multiply)
    -a-a-a DIVW: Used enough to be justifiable.
    -a-a-a MUL/DIV/REM: 64 bit forms not used often enough to justify.
    -a-a-a-a-a But, also not quite rare enough for trap-and-emulate.
    -a A:
    -a-a-a Better relegated to trap-and-emulate.
    -a F:
    -a-a-a Kinda unavoidable
    -a-a-a Would have preferred a Zfmin+D style approach as minimal case.
    -a D:
    -a-a-a It is what it is;
    -a-a-a FDIV.D and FSQRT.D and similar can be turned into traps.
    -a Zicsr:
    -a-a-a Most cases can trap (except what HW actually needs to support);
    -a Zifence:
    -a-a-a Trap-and-emulate.
    -a ...


    Almost making sense to have a sort of system-level trap-vector-table:
    -a TVTB: Trap Vector Table Base;
    -a JTVT: Jump to Trap-Vector-Table.

    Table would primarily be used for trap-and-emulate, where it could
    become desirable to have a 32-bit encoding here. These would be assumed illegal in the user level ISA.


    One possibility for encoding could, ironically, be to overload JAL or similar:
    -a JAL-a Disp20, X2-a => JTVT Disp20
    Branching to TVTB+Disp20 rather than to PC+Disp20 (and then stomping the stack with the LR, can safely assume that this case is otherwise invalid...). Implicit assumption that one has under 1MB of handler thunks.


    But, dunno.


    ...


    It is quite an art to get something working in minimum clock cycles and
    fast clock frequency.

    Any OoO machine is also likely to have a lot of RAM and a decent
    sized
    I$, so much of any benefit is likely to go away in this case.

    s/go away/greatly ameliorated/

    ------------------------
    ILP is a property of a program.-a I assume that what you mean is that >>>>>>> the IPC benefits of more width have quickly diminishing returns on >>>>>>> in-order machines.


    The ILP is a property of the code, yes, but how much exists, and how >>>>>> much of it is actually usable, is effected by the processor
    implementation.

    I agree that ILP is more aligned with code than with program.
    {see above example where 1 instruction does the work of 5}



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 2 18:48:13 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 1/1/2026 12:13 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-12-31 12:12 p.m., MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    BGB <cr88192@gmail.com> posted:

    On 12/30/2025 1:36 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller >>>>>>> register space, one needs to reuse registers more frequently, which then
    reduces ILP due to register conflicts. So, smaller code at the expense
    of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to >>>>>> use the uncompressed instruction. So you may tune your RISC-V
    compiler to prefer registers r8-r15 for those pseudo-registers that >>>>>> occur in instructions where such a register allocation may lead to a >>>>>> compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of OoO >>>>>> implementations. On the contrary, write-after-read may be beneficial >>>>>> by releasing the old physical register for the register name. And >>>>>> designing a compressed CPU instruction set for in-order processing is >>>>>> not a good idea for general-purpose computing.


    Though, the main places where compressed instructions are likely to >>>>> bring meaningful benefit, is on small in-order machines.

    Coincidentally; this is exactly where a fatter-ISA wins big::
    compare::

    LDD R7,[IP,R3<<3,DISP32]

    1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against >>>
    It is only 2 words

    AUPIC Rt,lo(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,hi(DISP32)
    ADD Rt,Rt,Ri
    LDD R7,0(Rt)

    5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum. >>>
    This should be::

    AUPIC Rt,hi(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,Ri
    LDD R7,lo(DISP32)(Rt)

    4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum >>
    An even fatter ISA (Qupls4) in theory:

    LOAD r7, disp56(ip+r3*8)

    I could have shown the DISP64 version--3-words


    At 64-bits, displacements cease to make sense as a displacement.
    Seems to make more sense to interpret these as [Abs64+Rb] rather than [Rb+Disp64].

    I have heard arguments in both directions::

    a) DISP64 only contains 33-bits of actual information
    b) If DISP64 is absolute do you still need Rbase ??
    when you have Rindex<<scale
    c) how can the HW KNOW ?!?

    Except, then I have to debate what exactly I would do if I decide to
    allow this case in XG2/XG3.
    ------------------
    I prefer to use multiply rCy*rCO rather than shift in scaled indexed
    addressing as a couple of CPUs had multiply by five and ten in addition
    to 1,2,4,8. What if one wants to scale by 3?

    If you have the bits, why not.


    Higher resource cost and latency is a concern...

    Yes, your design is living on the edge.

    -------------------------
    Sometimes the features dropped are subtle edge cases, say:
    For JAL and JALR in RISC-V:
    Is Xd ever anything other than X0 or X1?
    In theory, yes.
    In practice: Not so much.

    For reasons like this, I only have

    CALL DISP26<<2 // call through DECODE
    and
    CALX [*address] // call through table
    and
    CALA [address] // call through AGEN

    which prevents compiler and assembler abuse.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 2 18:51:11 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-01 6:17 p.m., BGB wrote:
    On 1/1/2026 12:13 PM, MitchAlsup wrote:
    -----------
    There is not much in the AGEN so the latency for a small multiplier is probably okay. I was thinking of supporting multiply by 3, handy for
    RGB888 values, and multiply by six which is the size of an instruction. Maybe multiply by any value from 1 to 8 using three-bit encoding. It
    would probably be okay to use another bit for scaling.

    You can probably work in a 4-bit multiplier (2 layers of 4-2 compressors)
    But you sacrifice being able to route low order bits to SRAM decoders.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Jan 2 17:08:38 2026
    From Newsgroup: comp.arch

    On 1/2/2026 12:48 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 1/1/2026 12:13 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-12-31 12:12 p.m., MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    BGB <cr88192@gmail.com> posted:

    On 12/30/2025 1:36 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 12/29/2025 12:35 PM, Anton Ertl wrote:
    [...]
    One usual downside is that to utilize a 16-bit ISA with a smaller >>>>>>>>> register space, one needs to reuse registers more frequently, which then
    reduces ILP due to register conflicts. So, smaller code at the expense
    of worse performance.

    For designs like RISC-V C and Thumb2, there is always the option to >>>>>>>> use the uncompressed instruction. So you may tune your RISC-V >>>>>>>> compiler to prefer registers r8-r15 for those pseudo-registers that >>>>>>>> occur in instructions where such a register allocation may lead to a >>>>>>>> compressed encoding.

    Write-after-read and write-after-write does not reduce the IPC of OoO >>>>>>>> implementations. On the contrary, write-after-read may be beneficial >>>>>>>> by releasing the old physical register for the register name. And >>>>>>>> designing a compressed CPU instruction set for in-order processing is >>>>>>>> not a good idea for general-purpose computing.


    Though, the main places where compressed instructions are likely to >>>>>>> bring meaningful benefit, is on small in-order machines.

    Coincidentally; this is exactly where a fatter-ISA wins big::
    compare::

    LDD R7,[IP,R3<<3,DISP32]

    1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against >>>>>
    It is only 2 words

    AUPIC Rt,lo(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,hi(DISP32)
    ADD Rt,Rt,Ri
    LDD R7,0(Rt)

    5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum. >>>>>
    This should be::

    AUPIC Rt,hi(DISP32)
    SLL Ri,R3,#3
    ADD Rt,Rt,Ri
    LDD R7,lo(DISP32)(Rt)

    4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum >>>>
    An even fatter ISA (Qupls4) in theory:

    LOAD r7, disp56(ip+r3*8)

    I could have shown the DISP64 version--3-words


    At 64-bits, displacements cease to make sense as a displacement.
    Seems to make more sense to interpret these as [Abs64+Rb] rather than
    [Rb+Disp64].

    I have heard arguments in both directions::

    a) DISP64 only contains 33-bits of actual information
    b) If DISP64 is absolute do you still need Rbase ??
    when you have Rindex<<scale
    c) how can the HW KNOW ?!?


    I have Disp33 and Abs64 as different encodings here.

    For RV64+JX:
    J21I+Ld/St: Rb+Disp33
    J52I+Ld/St: Abs64+Rb
    In the typical case, Rb would be X0 (Zero).

    In the ideal case, with a 48 bit adder, which is which wouldn't really
    matter. In my case, swapping them in the decoder allows it to keep on
    using the same 33-bit AGU (had experimented with 48 bit AGU, but as
    noted before, it has non-zero cost and has often been harder to justify).

    With Disp33 AGU, does quietly mean that when swapped the register is
    quietly limited to +/- 4GB.


    Though, can note that Zba on RISC-V requires full 64-bit logic for
    SHnADD, this could be routed through the ALU. But, to support parts of
    RISC-V, would effectively imply migrating the displacement shift and input-zero-extension onto the Rt port. Don't really want to do this.


    Likely, could be a 3-bit field, say:
    000: Pass value as-is
    001: Value<<1
    010: Value<<2
    011: Value<<3
    100: ZExt32(value)
    101: ZExt32(value)<<1
    110: ZExt32(value)<<2
    111: ZExt32(value)<<3

    Mostly because the way the various RISC-V extensions (and proposals are
    going) it implies supporting shift and zero extension on a bunch of
    different units (basically sticking the SHn and .UW parts on random
    stuff). Which is annoying, as it implies needing to deal with it at the register port.

    Then again, I think they are following ARM's pattern, where IIRC ARM
    does have this logic directly in the pipeline.


    As-is, I partly supported Zba by routing it through the AGU, but GCC
    can't use Zba because if the extension is enabled, GCC also uses it for
    64-bit integer operations, which is prone to break. Still probably need
    to fix this issue at some point.

    But, then need to debate which is the lesser evil:
    Add special case to allow full 64-bit ADD via AGU;
    Add shift and zero-ext input logic to the ALU;
    Add this logic to the ID2 stage (Register Fetch);
    ...

    And, if I go with one of the former two, what of possible extensions
    adding ".UW" instruction variants routed to places unrelated to the AGU
    and ALU?...

    Alas...


    Except, then I have to debate what exactly I would do if I decide to
    allow this case in XG2/XG3.
    ------------------
    I prefer to use multiply rCy*rCO rather than shift in scaled indexed
    addressing as a couple of CPUs had multiply by five and ten in addition >>>> to 1,2,4,8. What if one wants to scale by 3?

    If you have the bits, why not.


    Higher resource cost and latency is a concern...

    Yes, your design is living on the edge.


    I am not sure how it would be pulled off for larger displacements or
    more general scales.

    Though: 3, 5, 6, 9, 10
    Could be pulled off by adding two shifted inputs.
    Allowing: 1, 2, 3, 4, 5, 6, 8, 9, 10
    7 would be harder, but could be possible if the logic could express:
    (Ri<<3)+(~Ri)+1

    Would be easier if one makes a restriction, say, that NPOT scales are
    only allowed if the displacement is less than 64K or so.


    While on an FPGA, one can in theory use a DSP48 slice for fast multiply
    (at least up to 17u/18s bits), the logic for a DSP48 isn't itself all
    that low latency (and not so great if the bits need to get anywhere or
    do anything else within that clock cycle).


    Timing is annoying and unforgiving sometimes...

    This is, alas, why I am mostly doing everything at 50MHz:
    Far easier than 75 or 100.

    And, at 100 MHz, one is hard-pressed to do that much beyond an ISA like
    RV32IM or similar (with small/minimal caches, like aligned-only and with
    no internal store-to-load forwarding, ...).

    Where, store-to-load forwarding is kinda steep in terms of cost, but
    also has a significant performance impact in some cases (such as LZ4 decoding). In effect, disabling this forwarding roughly halving the performance of an LZ4 decoder (note that using solely byte operations
    being a 6x slowdown, and combining both is around a 14x slowdown).


    Say:
    void _mem_cpy16bytes(void *dst, void *src)
    {
    byte *cs, *ct;
    cs=src; ct=dst;
    ct[ 0]=cs[ 0]; ct[ 1]=cs[ 1]; ct[ 2]=cs[ 2]; ct[ 3]=cs[ 3];
    ct[ 4]=cs[ 4]; ct[ 5]=cs[ 5]; ct[ 6]=cs[ 6]; ct[ 7]=cs[ 7];
    ct[ 8]=cs[ 8]; ct[ 9]=cs[ 9]; ct[10]=cs[10]; ct[11]=cs[11];
    ct[12]=cs[12]; ct[13]=cs[13]; ct[14]=cs[14]; ct[15]=cs[15];
    }
    Is, slow...

    The store-to-load forwarding penalty being because LZ4 decompression
    often involves copying memory on top of itself, and the possible
    workarounds for this issue only offer competitive performance for blocks
    that are much longer than the typical copy (in the common case of a
    match under 20 bytes, it often being faster to just copy bytes and eat
    the cost).

    if(dist>=16)
    {
    if(len>=20)
    { more generalized/faster copy }
    else
    { just copy 20 bytes. }
    }else
    {
    if(len>=20)
    { generate pattern and fill with stride }
    else
    { copy 20 bytes over itself. }
    }


    -------------------------
    Sometimes the features dropped are subtle edge cases, say:
    For JAL and JALR in RISC-V:
    Is Xd ever anything other than X0 or X1?
    In theory, yes.
    In practice: Not so much.

    For reasons like this, I only have

    CALL DISP26<<2 // call through DECODE
    and
    CALX [*address] // call through table
    and
    CALA [address] // call through AGEN

    which prevents compiler and assembler abuse.


    They went and defined that you can use any register as a link register,
    but in practice there is basically no reason to use alternative link registers. ASM programmer people could do so, but not seen all that much evidence of this being a thing thus far.

    Otherwise, JAL could have had a displacement of +/- 16MB for the same
    number of bits.

    Beyond just the encoding, there is non-zero implementation cost
    associated with allowing a free choice of link register (and in my case,
    also a non-zero performance cost; as effectively the branch predictor
    can only see the normal link register).

    There is some mention of people using call/return predictor stacks
    hidden in the CPU, but I had not done so.


    Rather, the branch predictor looks through the link register. Does
    generally mean that the link register needs to be the first register
    reloaded in the epilog, such that ideally its value completes the load
    before reaching the final JALR (a load immediately followed by a JALR
    will require going the slower route).

    Not entirely sure how a call/return stack would work in the case of a
    LD+JALR, as still (absent a pipeline stall or delayed logic) it wouldn't
    be possible to see a mispredicted return until after control-flow would
    have already reached back into the caller (and there basically being no
    good way to recover from this).

    Well, unless one has a bunch of write-back delay stages, so that they
    can try to roll-back to the pipeline as it were at the point of the JALR several cycles after the JALR would have otherwise reached the WB stage
    (or, you need to stall at least until the LD finishes so that it can be confirmed whether or not it goes back to the expected place, so say, 5
    cycles vs 2 cycles).

    Well, say, vs my approach:
    LD X1, Disp(SP); ....; JALR X0, 0(X1)
    The JALR is 1 cycle (CPU can see no in-flight modifications to LR, so it
    turns into a predicted unconditional branch).

    But:
    LD X1, -8(SP); JALR X0, 0(X1)
    Yeah, enjoy those 13 or so clock cycles.

    Could add an intermediate case, where is sees the in-flight LR
    modification and predicts the value via an internal stack, but stalls
    may still be used as needed so that the JALR can at least see whether or
    not the guess was a mispredict.



    Otherwise, finally got around to extending the WEXifier logic in BGBCC
    to cover RISC-V and XG3. In this case, there is no bundling in these
    ISAs, so its role is to instead try to shuffle instructions to improve pipeline behavior.

    General compiler behavior is, as noted:
    First, generate the machine code in the usually way;
    Make passes over the instructions, looking for swaps that could improve efficiency.

    Looks over a sliding window of 10 or 12 instructions:
    4 preceding instructions (-4 to -1);
    4 active instructions (0 to 3);
    2 or 4 following instructions (*).
    Released code currently has 2 following,
    Had increased it to 4 in the working version.

    And, at each step, calculates cost functions for any penalties due to register-dependencies within the instructions;
    Also for scenarios where some of the instructions are swapped around
    (within said 4 instruction window);
    It also checks which instructions are allowed to be swapped around (such
    as register dependencies between instructions, etc), checking for things
    like labels and relocs (instructions may not be swapped across a label,
    and are locked in place if covered by a reloc), ...

    It basically makes multiple passes, and stops once there stop being any swapped instructions.

    This is because a swap in one place may change the cost tradeoff in
    another place, and each set of swaps can only try to evaluate for the
    local optimum.

    Window size needs to be kept modest mostly to avoid excessive hair and
    to allow the compiler to run at reasonable speeds.

    As noted, the current permutations it evaluates are (for A,B,C,D):
    Swapping A and B (B,A,C,D);
    Rotating A,B,C,D to B,C,A,D
    Rotating A,B,C,D to C,A,B,D
    Rotating A,B,C,D to B,C,D,A
    Rotating A,B,C,D to D,A,B,C
    Swapping A and C (C,B,A,D)
    Swapping A and D (D,B,C,A)
    This is 7 out of 24 possibilities, but evaluating all 24 would be steep.
    The others can be checked indirectly by the window advancing to the
    following positions.

    It mostly evaluates for register-register dependencies, but also adds a
    small cost penalty for putting two lane-1-only instructions next to each other. Had to keep this penalty pretty small though as otherwise it ends
    up hurting more than it helps.


    The logic only works with 32-bit instructions (doesn't support 16-bit
    ops, and any 64/96 bit ops are effectively anchored in place).

    It is currently disabled for RV64GC, as while I was able to make it
    coexist with RV-C, they interact poorly enough to defeat any merit (it
    ends up with the performance penalty of RV64GC combined with the code
    density of RV64G; leaving it disabled, at least, doesn't render RV-C
    useless, even if using it does still come with worse performance).

    Making this stuff work with variable-length instructions isn't really something I am inclined to face at the moment (would add a fair bit of complexity).

    As noted, compiler does this because generally the hardware can't.



    One other recent change (to address possible bug conditions) were that
    if "volatile" is seen in the area, is assumes that any memory load/store
    ops may alias (it isn't clever enough to figure out which ops are
    associated with a volatile variable, only that there is a volatile
    variable somewhere in the vicinity and needs to be careful).

    This is because, unlike with normal memory, things like swapping the
    relative order of volatile loads may have visible effects (say, if
    accessing a no-cache address or MMIO). This edge case had not been
    addressed previously.

    Otherwise, behavior is like:
    Load/Load:
    Assume No Alias;
    Else: Check for Alias;
    Rb1==Rb2:
    Alias if both have an overlapping offset range;
    Else, no alias.
    (Rb1==SP && Rb2!=SP) || (Rb1!=SP && Rb2==SP)
    Alias if an address has been taken of a local variable;
    Else: No alias
    (Rb1==GP && Rb2!=GP) || (Rb1!=GP && Rb2==GP)
    Alias if accessed a global whose address has been taken.
    Else: No alias
    (Rb1==GP && Rb2==SP) || (Rb1==SP && Rb2==GP)
    Never alias.
    Else: Assume potential alias.
    For uncorrelated memory accesses,
    there is no way to prove alias here.
    Though, there is other (different) alias handling in earlier stages.
    Caching array or member loads/stores happens in the IR stages.
    So, this is mostly concerned with the actual CPU instructions.
    ...

    Where alias handling here is used for infer whether it can swap the
    relative order of memory access instructions.


    Ironically, does help with both RV64 and XG3 performance, but widens the
    gap between them (helps performance more with XG3 than with RISC-V for whatever reason).

    Had needed a little debugging to try to make sure it didn't break stuff.
    Very subtle issues in this sort of logic can result in bugs and crashes.



    Started working on C11 "_Atomic" handling, currently it is being handled
    sort of like a volatile type but will also need to signal no-cache
    access for pointers, etc (normal volatile will merely use normal
    load/store operations, rather than atomic or no-cache memory access).

    In RISC-V, these cases can be encoded via the AMO instructions.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jan 3 02:05:15 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 1/2/2026 12:48 PM, MitchAlsup wrote:
    -----merciful snip----------
    I have heard arguments in both directions::

    a) DISP64 only contains 33-bits of actual information
    b) If DISP64 is absolute do you still need Rbase ??
    when you have Rindex<<scale
    c) how can the HW KNOW ?!?
    ------------------
    I prefer to use multiply rCy*rCO rather than shift in scaled indexed >>>> addressing as a couple of CPUs had multiply by five and ten in addition >>>> to 1,2,4,8. What if one wants to scale by 3?

    If you have the bits, why not.


    Higher resource cost and latency is a concern...

    Yes, your design is living on the edge.

    This, BTW is a compliment--the best an architect can do is make every
    stage of the pipeline have the same delay !!

    I am not sure how it would be pulled off for larger displacements or
    more general scales.

    Better adder technology. We routinely pound an 11-gate adder into the
    delay of 8|uFan4 gate delays.
    ------------------------------------
    Say:
    void _mem_cpy16bytes(void *dst, void *src)
    {
    byte *cs, *ct;
    cs=src; ct=dst;
    ct[ 0]=cs[ 0]; ct[ 1]=cs[ 1]; ct[ 2]=cs[ 2]; ct[ 3]=cs[ 3];
    ct[ 4]=cs[ 4]; ct[ 5]=cs[ 5]; ct[ 6]=cs[ 6]; ct[ 7]=cs[ 7];
    ct[ 8]=cs[ 8]; ct[ 9]=cs[ 9]; ct[10]=cs[10]; ct[11]=cs[11];
    ct[12]=cs[12]; ct[13]=cs[13]; ct[14]=cs[14]; ct[15]=cs[15];
    }
    Is, slow...
    Better ISA::

    MM Rto,Rfrom,#16

    and let HW do all the tricky/cool stuff--just make sure if you put it
    in you fully support all the cool/tricky stuff.

    The store-to-load forwarding penalty being because LZ4 decompression
    often involves copying memory on top of itself, and the possible
    workarounds for this issue only offer competitive performance for blocks that are much longer than the typical copy (in the common case of a
    match under 20 bytes, it often being faster to just copy bytes and eat
    the cost).

    if(dist>=16)
    {
    if(len>=20)
    { more generalized/faster copy }
    else
    { just copy 20 bytes. }
    }else
    {
    if(len>=20)
    { generate pattern and fill with stride }
    else
    { copy 20 bytes over itself. }
    }

    This is a problem easier solved in HW than in source code.


    For reasons like this, I only have

    CALL DISP26<<2 // call through DECODE
    and
    CALX [*address] // call through table
    and
    CALA [address] // call through AGEN

    which prevents compiler and assembler abuse.


    They went and defined that you can use any register as a link register,

    Another case where they screwed up.....

    but in practice there is basically no reason to use alternative link registers. ASM programmer people could do so, but not seen all that much evidence of this being a thing thus far.

    In Mc88k we recognized (and made compiler follow)
    JMP R1 // return from subroutine
    JMP ~R1 // switch
    -------------------
    Well, say, vs my approach:
    LD X1, Disp(SP); ....; JALR X0, 0(X1)
    The JALR is 1 cycle (CPU can see no in-flight modifications to LR, so it turns into a predicted unconditional branch).

    But:
    LD X1, -8(SP); JALR X0, 0(X1)
    Yeah, enjoy those 13 or so clock cycles.

    CALX R0,[address]
    ....

    Address is computed in normal AGEN, but processed in ICache, where it
    FETCHes wide data (128-bits small machine, whole cache line larger
    machine), and runs the result through Instruction buffer. 4 cycles.

    -------------
    Looks over a sliding window of 10 or 12 instructions:
    4 preceding instructions (-4 to -1);
    4 new instructions on previous predicted path (0 to 3);
    4 alternate instructions on current predicted path
    // so one can decode and issue non-sequential instructions
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Jan 3 17:37:47 2026
    From Newsgroup: comp.arch

    On 2026-01-02 9:05 p.m., MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 1/2/2026 12:48 PM, MitchAlsup wrote:
    -----merciful snip----------
    I have heard arguments in both directions::

    a) DISP64 only contains 33-bits of actual information
    b) If DISP64 is absolute do you still need Rbase ??
    when you have Rindex<<scale
    c) how can the HW KNOW ?!?
    ------------------
    I prefer to use multiply rCy*rCO rather than shift in scaled indexed >>>>>> addressing as a couple of CPUs had multiply by five and ten in addition >>>>>> to 1,2,4,8. What if one wants to scale by 3?

    If you have the bits, why not.


    Higher resource cost and latency is a concern...

    Yes, your design is living on the edge.

    This, BTW is a compliment--the best an architect can do is make every
    stage of the pipeline have the same delay !!

    I am not sure how it would be pulled off for larger displacements or
    more general scales.

    Better adder technology. We routinely pound an 11-gate adder into the
    delay of 8|uFan4 gate delays.
    ------------------------------------
    Say:
    void _mem_cpy16bytes(void *dst, void *src)
    {
    byte *cs, *ct;
    cs=src; ct=dst;
    ct[ 0]=cs[ 0]; ct[ 1]=cs[ 1]; ct[ 2]=cs[ 2]; ct[ 3]=cs[ 3]; >> ct[ 4]=cs[ 4]; ct[ 5]=cs[ 5]; ct[ 6]=cs[ 6]; ct[ 7]=cs[ 7]; >> ct[ 8]=cs[ 8]; ct[ 9]=cs[ 9]; ct[10]=cs[10]; ct[11]=cs[11]; >> ct[12]=cs[12]; ct[13]=cs[13]; ct[14]=cs[14]; ct[15]=cs[15]; >> }
    Is, slow...
    Better ISA::

    MM Rto,Rfrom,#16

    and let HW do all the tricky/cool stuff--just make sure if you put it
    in you fully support all the cool/tricky stuff.

    The store-to-load forwarding penalty being because LZ4 decompression
    often involves copying memory on top of itself, and the possible
    workarounds for this issue only offer competitive performance for blocks
    that are much longer than the typical copy (in the common case of a
    match under 20 bytes, it often being faster to just copy bytes and eat
    the cost).

    if(dist>=16)
    {
    if(len>=20)
    { more generalized/faster copy }
    else
    { just copy 20 bytes. }
    }else
    {
    if(len>=20)
    { generate pattern and fill with stride }
    else
    { copy 20 bytes over itself. }
    }

    This is a problem easier solved in HW than in source code.


    For reasons like this, I only have

    CALL DISP26<<2 // call through DECODE
    and
    CALX [*address] // call through table
    and
    CALA [address] // call through AGEN

    which prevents compiler and assembler abuse.


    They went and defined that you can use any register as a link register,

    Another case where they screwed up.....

    but in practice there is basically no reason to use alternative link
    registers. ASM programmer people could do so, but not seen all that much
    evidence of this being a thing thus far.

    In Mc88k we recognized (and made compiler follow)
    JMP R1 // return from subroutine
    JMP ~R1 // switch
    -------------------
    Well, say, vs my approach:
    LD X1, Disp(SP); ....; JALR X0, 0(X1)
    The JALR is 1 cycle (CPU can see no in-flight modifications to LR, so it
    turns into a predicted unconditional branch).

    But:
    LD X1, -8(SP); JALR X0, 0(X1)
    Yeah, enjoy those 13 or so clock cycles.

    CALX R0,[address]
    ....

    Address is computed in normal AGEN, but processed in ICache, where it
    FETCHes wide data (128-bits small machine, whole cache line larger
    machine), and runs the result through Instruction buffer. 4 cycles.

    -------------
    Looks over a sliding window of 10 or 12 instructions:
    4 preceding instructions (-4 to -1);
    4 new instructions on previous predicted path (0 to 3);
    4 alternate instructions on current predicted path
    // so one can decode and issue non-sequential instructions

    They could have put which GPR(s) is the link register in a CSR, if it
    was desired to keep the paradigm of generality. I started working on
    Qupls5 which is going to use a 32-bit ISA. The extra bits used to
    specify a GPR as a link register are better used as branch displacement
    bits IMO. I would be tempted to use two bits though to specify the LR,
    as sometimes a second LR is handy.

    A choice is whether to use GPRs as link registers. Not using a GPR gives
    an extra register or two for GPR use. Using dedicated link register(s)
    works well with a dedicated RET instruction. RET should be able to
    deallocate the stack. IMO using a dedicated link register is a bit like
    using an independent PC register. Or using a GPR for the link register
    is a bit like using a GPR for the PC.

    Qupls5 is going to use instruction fusing for compare-and-branch
    instructions. A compare followed by an unconditional branch will be
    treated as one instruction. That gives a 23-bit branch displacement.
    Otherwise with a 32-bit instruction, a 12-bit branch displacement is not really quite enough for modern software. Sure, it works 90+% of the time
    but, it adds a headache to assembling and linking programs for when it
    does not work.

    Qupls5 will use constant postfixes which extend the constant by 22-bits
    for each postfix used. To get a 64-bit constant three postfixes will be required. Not quite as clean as universal constants, but simple to
    implement in hardware.

    Stuck on synthesis for Qupls4 which keeps omitting modules from the
    design. I must have checked the module inputs and outputs dozens of
    times, and do not know why they are excluded.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Jan 3 18:02:03 2026
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Looking at the "18 RISC-V Compressed ISA V1.9" specification

    Someone asked for more and dynamic numbers. This work contains them
    in Section 1.9:

    |Table 1.7 lists the standard RVC instructions with the most frequent
    |first, showing the individual contributions of those instructions to
    |static code size and then the running total for three experiments: the
    |SPEC benchmarks for both RV32C and RV64C for the Linux kernel. For
    |RV32, RVC reduces static code size by 24.5% on Dhrystone and 30.9% on |CoreMark. For RV64, it reduces static code size by 26.3% on SPECint,
    |25.8% on SPECfp, and 31.1% on the Linux kernel.
    |
    |Table 1.8 ranks the RVC instructions by order of typical dynamic
    |frequency. For RV32, RVC reduces dynamic bytes fetched by 29.2% on |Dhrystone and 29.3% on CoreMark. For RV64, it reduces dynamic bytes
    |fetched by 26.9% on SPECint, 22.4% on SPECfp, and 26.11% booting the
    |Linux kernel.

    If you want the tables, look at source: <https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf>

    - anton

    I came across a paper which suggests various changes to the RVC ISA
    to improve the compaction rate based on the actual usage.
    Things like noting that register a5 is used with ADDI Add Immediate
    instruction 40% of the time. You could hard code the a5 into an
    opcode and use those bits in the immediate field.
    Or hard coding the link register for JAL.

    Reduce Static Code Size and Improve RISC-V Compression 2019 https://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-107.pdf

    But RVC has been getting 50%+ compression rates since the start in 2011.
    Mostly this is fiddling around the edges for an extra 5% or so.

    Improving Energy Efficiency and Reducing Code Size with
    RISC-V Compressed 2011 https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-63.pdf


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jan 3 23:09:37 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-02 9:05 p.m., MitchAlsup wrote:
    ----------merciful snip-----------

    Looks over a sliding window of 10 or 12 instructions:
    4 preceding instructions (-4 to -1);
    4 new instructions on previous predicted path (0 to 3);
    4 alternate instructions on current predicted path
    // so one can decode and issue non-sequential instructions

    They could have put which GPR(s) is the link register in a CSR, if it
    was desired to keep the paradigm of generality. I started working on
    Qupls5 which is going to use a 32-bit ISA. The extra bits used to
    specify a GPR as a link register are better used as branch displacement
    bits IMO. I would be tempted to use two bits though to specify the LR,
    as sometimes a second LR is handy.

    In my opinion, you are correct, more displacement is a lot better than
    being able to specify a GPR. RISC-V did the specification so that
    Epilogue could call register save routine and Prologue could call
    register reload subroutine. In stead, what they should have done is
    build a small register shuffling state machine between register file
    and cache.

    A choice is whether to use GPRs as link registers. Not using a GPR gives
    an extra register or two for GPR use. Using dedicated link register(s)
    works well with a dedicated RET instruction. RET should be able to deallocate the stack. IMO using a dedicated link register is a bit like using an independent PC register. Or using a GPR for the link register
    is a bit like using a GPR for the PC.

    I went a tad further:: EXIT restores the preserved registers and <optionally::most of the time> transfers control back following call.
    The LDD IP is performed first, then the registers are reloaded, then
    the stack frame is deallocated. {This puts one in the position to
    discard popped cache lines, saving memory BW.}

    When an EXIT is in progress, one can fetch the instructions at RET
    address and if a near subsequent CALL, that Subroutine's ENTER
    instruction can be short circuited because the saved registers are
    on the same place in the stack from return to new entry point !!
    This not only saves memory bandwidth, it saves cycles, too.

    Qupls5 is going to use instruction fusing for compare-and-branch instructions. A compare followed by an unconditional branch will be
    treated as one instruction. That gives a 23-bit branch displacement. Otherwise with a 32-bit instruction, a 12-bit branch displacement is not really quite enough for modern software. Sure, it works 90+% of the time but, it adds a headache to assembling and linking programs for when it
    does not work.

    My 66000 is yet to run into a subroutine that needs more than its 16-bit word-displacement in its typical BC and BB instructions. {It will happen
    it just does not happen early in SW development.}

    My 66000 WILL fuse CMP-BC instructions, too.

    Qupls5 will use constant postfixes which extend the constant by 22-bits
    for each postfix used. To get a 64-bit constant three postfixes will be required. Not quite as clean as universal constants, but simple to
    implement in hardware.

    Since REV 2.0 of My 66000, I am looking are the logic needed to derive instruction-length. In Rev 1.0 it took 32 gates and 4-gates of delay,
    In REV 2.0 (as it stands now) it takes 25 (more regularity), but if
    STs with large constant-data were discarded, the instruction length
    decoder drops to 6 total gates and 2-gates of delay--with no Fan-Out
    or Fan-In greater than 3.

    Stuck on synthesis for Qupls4 which keeps omitting modules from the
    design. I must have checked the module inputs and outputs dozens of
    times, and do not know why they are excluded.

    Good luck.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Jan 4 13:07:38 2026
    From Newsgroup: comp.arch

    BGB wrote:
    Say:
    -a void _mem_cpy16bytes(void *dst, void *src)
    -a {
    -a-a-a byte *cs, *ct;
    -a-a-a cs=src; ct=dst;
    -a-a-a ct[ 0]=cs[ 0];-a-a-a ct[ 1]=cs[ 1];-a-a-a ct[ 2]=cs[ 2];-a-a-a ct[ 3]=cs[ 3];
    -a-a-a ct[ 4]=cs[ 4];-a-a-a ct[ 5]=cs[ 5];-a-a-a ct[ 6]=cs[ 6];-a-a-a ct[ 7]=cs[ 7];
    -a-a-a ct[ 8]=cs[ 8];-a-a-a ct[ 9]=cs[ 9];-a-a-a ct[10]=cs[10];-a-a-a ct[11]=cs[11];
    -a-a-a ct[12]=cs[12];-a-a-a ct[13]=cs[13];-a-a-a ct[14]=cs[14];-a-a-a ct[15]=cs[15];
    -a }
    Is, slow...

    The store-to-load forwarding penalty being because LZ4 decompression
    often involves copying memory on top of itself, and the possible
    workarounds for this issue only offer competitive performance for blocks that are much longer than the typical copy (in the common case of a
    match under 20 bytes, it often being faster to just copy bytes and eat > the cost).

    -a if(dist>=16)
    -a {
    -a-a-a if(len>=20)
    -a-a-a-a-a { more generalized/faster copy }
    -a-a-a else
    -a-a-a-a-a { just copy 20 bytes. }
    -a }else
    -a {
    -a-a-a if(len>=20)
    -a-a-a-a-a-a { generate pattern and fill with stride }
    -a-a-a else
    -a-a-a-a-a-a { copy 20 bytes over itself. }
    -a }

    In my own LZ4 implementation I was able to beat Google's version
    specifically due to a better implementation of repeated pattern fills:
    I use SSE/AVX with a set of swizzle tables, so that I can take
    1/2/3/4..16 bytes and repeat them as many times as possible into a
    32-byte target. I.e. if the pattern length is 3 then the table to use
    will contain 0,1,2,0,1,2,0,1,2...0,1,2,0 in the first 16-byte entry and
    then 1,2,0,1,2,0...1,2,0,1,2 in the second entry.
    Alongside this I have of course stride length (30 in the above example), so that for long patterns I step forward by that much before doing
    another 32-byte store.
    The rest of their code was pretty good. :-)
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Jan 4 16:11:26 2026
    From Newsgroup: comp.arch

    On 1/4/2026 6:07 AM, Terje Mathisen wrote:
    BGB wrote:
    Say:
    -a-a void _mem_cpy16bytes(void *dst, void *src)
    -a-a {
    -a-a-a-a byte *cs, *ct;
    -a-a-a-a cs=src; ct=dst;
    -a-a-a-a ct[ 0]=cs[ 0];-a-a-a ct[ 1]=cs[ 1];-a-a-a ct[ 2]=cs[ 2];-a-a-a ct[ 3]=cs[ 3];
    -a-a-a-a ct[ 4]=cs[ 4];-a-a-a ct[ 5]=cs[ 5];-a-a-a ct[ 6]=cs[ 6];-a-a-a ct[ 7]=cs[ 7];
    -a-a-a-a ct[ 8]=cs[ 8];-a-a-a ct[ 9]=cs[ 9];-a-a-a ct[10]=cs[10];-a-a-a ct[11]=cs[11];
    -a-a-a-a ct[12]=cs[12];-a-a-a ct[13]=cs[13];-a-a-a ct[14]=cs[14];-a-a-a ct[15]=cs[15];
    -a-a }
    Is, slow...

    The store-to-load forwarding penalty being because LZ4 decompression
    often involves copying memory on top of itself, and the possible
    workarounds for this issue only offer competitive performance for
    blocks that are much longer than the typical copy (in the common case
    of a match under 20 bytes, it often being faster to just copy bytes
    and eat the cost).

    -a-a if(dist>=16)
    -a-a {
    -a-a-a-a if(len>=20)
    -a-a-a-a-a-a { more generalized/faster copy }
    -a-a-a-a else
    -a-a-a-a-a-a { just copy 20 bytes. }
    -a-a }else
    -a-a {
    -a-a-a-a if(len>=20)
    -a-a-a-a-a-a-a { generate pattern and fill with stride }
    -a-a-a-a else
    -a-a-a-a-a-a-a { copy 20 bytes over itself. }
    -a-a }


    In my own LZ4 implementation I was able to beat Google's version specifically due to a better implementation of repeated pattern fills:

    I use SSE/AVX with a set of swizzle tables, so that I can take
    1/2/3/4..16 bytes and repeat them as many times as possible into a 32-
    byte target. I.e. if the pattern length is 3 then the table to use will contain 0,1,2,0,1,2,0,1,2...0,1,2,0 in the first 16-byte entry and then 1,2,0,1,2,0...1,2,0,1,2 in the second entry.

    Alongside this I have of course stride length (30 in the above example),
    so that for long patterns I step forward by that much before doing
    another 32-byte store.

    The rest of their code was pretty good. :-)


    OK, though in my case, neither BJX2 nor RISC-V have any sort of
    efficient byte shuffle/swizzle instructions that could be useful for
    this case.


    Can fill in patterns sorta like:
    v0=*(u64 *)cs;
    switch(dist)
    {
    case 1:
    v0=v0&0xFF;
    v0|=v0<<8;
    v0|=v0<<16;
    v0|=v0<<32;
    v1=v0;
    stride=16;
    break;
    case 2:
    v0=v0&0xFFFF;
    v0|=v0<<16;
    v0|=v0<<32;
    v1=v0;
    stride=16;
    break;
    case 3:
    v0=v0&0xFFFFFF;
    v0|=v0<<24;
    v1=v0>>16;
    v0|=v0<<48;
    v1|=v0<<32;
    stride=15;
    break;
    ...
    }

    Then, push this out this pattern to memory and advance by stride each
    time, but... for short runs this can end up slower than naive unrolled
    byte copying, and the short runs are more common than the longer runs.

    As can be noted, with LZ4 the short cases could be detected based on the
    tag byte.

    Possibly:
    if(((tag+0x11)^tag)&0x88)
    {
    //longer copying needed
    }else
    {
    //short copies
    }

    With, short copy case possibly like (from memory):
    rl=(tg>>4)&15;
    ml=(tg&15)+4;
    v0=((u64 *)cs)[0];
    v1=((u64 *)cs)[1];
    cs+=rl;
    md=*(u16 *)cs;
    ((u64 *)ct)[0]=v0;
    ((u64 *)ct)[1]=v1;
    ct+=rl;
    if((cs>=cse) || !md)
    { /* end of data */ break; }
    cs1=ct-md;
    if(md>=24)
    {
    v0=((u64 *)cs1)[0];
    v1=((u64 *)cs1)[1];
    v2=((u64 *)cs1)[2];
    ((u64 *)ct)[0]=v0;
    ((u64 *)ct)[1]=v1;
    ((u64 *)ct)[2]=v2;
    }else
    {
    ct[0]=cs1[0];
    ct[1]=cs1[1];
    ...
    }
    ct+=ml;

    or such...


    But, yeah, can note that did go and get GLQuake working as an XG3 build.
    This did require patching up some compiler holes mostly related to the
    XG3 target and SIMD.

    Maybe TODO is try to get it working with a RISC-V build (with my
    compiler), though in this case I will either need to get more of the
    runtime call fallbacks implemented, or use some of my (possibly
    controversial, *1) non-standard SIMD.

    *1: Rather than RV-V, instead implemented a SIMD system for RV based
    around treating the F0..F31 registers as pairs of Binary32 values and
    similar rather than scalar Binary32 values. Generally much easier to
    deal with.

    SIMD being sort of needed for the "make OpenGL not perform like dog
    crap" thing.

    I did previously build GLQuake for RV64G via GCC, but this is slow
    enough to basically be entirely unusable.


    At least in theory, BGBCC's SIMD extensions for C should work even
    without dedicated HW support, but does depend some on fallback wrapper functions, many of which are still missing for RV-based targets. Where,
    as noted, the basic extensions take a form more like GLSL but mapped to
    C. For TKRA-GL, there were wrappers to allow using either dedicated SIMD
    or struct-based fallbacks (or Intel/MSVC or GCC-style as well).

    BGBCC sorta supports GCC style SIMD as well ("__attribute__((vector_size(16)))" and such), though there are
    differences (and GCC's still seems to depend on native hardware support, though is a little higher abstraction than Intel/MSVC style; which was
    mostly based around use of builtin/intrinsic functions). Though, a few
    things were borrowed from the latter as well, but repurposed (for
    example, "__m64" and "__m128" more serving as an untyped vector type for casting between other types; and to what extent builtins are used they typically don't necessarily map 1:1 with CPU instructions).

    One limitation is that it only supports vector sizes that map to those supported by BGBCC, say:
    4x Binary16
    4x Binary32
    2x Binary32
    2x Binary64

    Possibly, could add an 8x Binary16 vector, but don't have an immediate use-case, and as noted it is non-native (though, in this case non-native
    types would still generally be faster than scalar code).



    Granted, the lack of need for HW support wouldn't be proven until it
    works on unmodified RV64G; which may need work (looks like some of the
    code isn't checking for SIMD instruction support before trying to use
    them, will need to fix this). Well, likewise for features which exist
    with BJX2 (or XG3) which don't exist with RV64.

    And, still does require the target-specific part of going and writing
    out ASM for the various support functions (typically implemented
    internally using scalar instructions).

    Well, partly because not inclided to have different versions of the GL
    library depending on whether it is build to use BGBCC's SIMD or a
    generic fallback.



    Would be nice if there was some sort of convention here...

    Will probably need to subdivide the RV SIMD support as to whether or not
    it supports 128-bit vectors, or only 64-bit vectors (a cost optimized implementation might only want to deal with 64-bit vectors, or 2x Binary32).




    Not sure how OpenGL worked out on early SGI systems (such as the SGI
    Iris 4D), based on my own experience, I would expect the situation to
    have been "not good".

    There are examples of GLQuake and Quake 3 and similar running on an
    Indigo 2, but these would have been much faster machines (both with
    faster clock speeds and stronger dedicated graphics hardware), so would
    not likely have had as much trouble with Quake.

    With some demo examples of other 3D on the Iris 4D, looks like its 3D
    wasn't particularly fast by modern standards. So, probably couldn't have
    run Quake all that well.


    Ironically, when building for XG3, a few top-places in Quake are some of
    the "mathlib" functions, which in this case were falling back to generic
    C versions, which were mostly built around working with "float *" pointers.

    In the past had ended up writing ASM versions of DotProduct and
    BoxOnPlaneSide and similar, but in this case it falls back to generic versions.


    Otherwise, generally feeling kinda lonely at the moment.
    Working on stuff doesn't entirely compensate for general feelings of pointlessness.


    It is this or working on sci-fi stories of mine, which I had ended up
    partly going and using AI to give commentary partly as I seemingly can't
    get any actual humans to comment on my fiction.

    Apparently, I guess as far as the AI was concerned on one of the stories
    I threw at it (one about a Moon AI):
    General ideas/theme were compared to "Asimov's Foundation Series";
    But, story being more character-focused was more typical of soft sci-fi
    than hard sci-fi (which is primarily about world-building and politics); Seemingly doesn't pick up on things which were implied but not stated directly;
    Compared the ending to being more like something out of a Douglas Adams
    story;
    Also noted there being some amount of transhumanism and biopunk elements;
    etc.

    I am like, OK, fair enough. My influences are varied, but mostly stuff
    like "Ghost in the Shell" and "Mega Man" and similar (both of which have
    a lot trans-humanism themes). It seemingly misses some implicit things
    about how the plot fits together (like the plot relevance of a detour
    into talking about a character doing the whole E-Sports / Pro-Gamer
    thing), or why the conclusion had "Raggedy Ann" references, etc.

    Also doesn't seem to understood that most of the story isn't written
    from the perspective of "an all knowing external observer", but mostly
    is jumping between the perspectives of the characters within the story
    (so, some things would be vague or seemingly inconsistent because it
    would be based on what that character knows; rather than what the author
    or reader knows).

    Well, and missed this when throwing in the follow up story, where some vagueness about how things were described wasn't deliberate change, but
    more because people watching from Earth would have not known everything
    that was going on in the preceding story.

    Granted, maybe could try to make this more obvious, and maybe it is
    asking a bit much to expect Grok to understand sort of a "multiple perspectives" writing style (well, and some difficultly of describing
    the thinking of the "Moon AI" which is in some sense supposed to be more intelligent than myself, but rather than being a singular fully unified consciousness instead exists by fragmenting into large numbers of
    very-similar mildly-superhuman consciousnesses which mostly all then
    work collaboratively).


    Then again, probably not that much different from the normal human
    experience, say, where one identifies as a singular identity, but may
    exist as multiple semi-independent streams of thought which might
    themselves have quirks (say for example, one part being more
    rational-oriented and skeptical; another more emotional and fanciful; a
    third which mostly likes chasing rabbit holes wherever they go; ... all
    sort of funneling their thoughts into the greater whole).

    Then again, I am not entirely sure how most people experience their own existence, there doesn't seem to be much description of this.



    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 4 22:30:17 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 1/4/2026 6:07 AM, Terje Mathisen wrote:
    -----merciful snip-----------

    Can fill in patterns sorta like:
    v0=*(u64 *)cs;
    switch(dist)
    {
    case 1:
    v0=v0&0xFF;
    v0|=v0<<8;
    v0|=v0<<16;
    v0|=v0<<32;

    That was 1 instruction in my Samsung GPU ISA

    SWIZ V1,V0,#[0,8,16,24]

    The immediate was broken into 4-bit fields (represented above with
    8-bit fields) and the immediate was used as 8|u 4-bit Mux selectors
    from the other operand. {Not using Samsung ISA names or syntax}.

    When I looked deeply into the situation, it was easier in HW to do::

    for( i = 0; i < 8; i++ )
    out[field[i]] = in[i]

    than::
    for( i = 0; i < 8; i++ )
    out[i] = in[field[i]]

    For some reason we called this swizzle not permute !?!
    -------------------
    Otherwise, generally feeling kinda lonely at the moment.
    Working on stuff doesn't entirely compensate for general feelings of pointlessness.

    Seriously; get help.

    It is this or working on sci-fi stories of mine, which I had ended up
    partly going and using AI to give commentary partly as I seemingly can't
    get any actual humans to comment on my fiction.

    You are not the first purported author in this position

    Apparently, I guess as far as the AI was concerned on one of the stories
    I threw at it (one about a Moon AI):
    General ideas/theme were compared to "Asimov's Foundation Series";

    You do know that there are only 39 plots in all of literature ?!?

    <snip>
    I am like, OK, fair enough. My influences are varied, but mostly stuff
    like "Ghost in the Shell" and "Mega Man" and similar (both of which have
    a lot trans-humanism themes). It seemingly misses some implicit things
    about how the plot fits together (like the plot relevance of a detour
    into talking about a character doing the whole E-Sports / Pro-Gamer
    thing), or why the conclusion had "Raggedy Ann" references, etc.

    Hint: use fewer words and paragraphs to convey the same amount of information. Works in literature and in this NG.
    ------------------
    Then again, I am not entirely sure how most people experience their own existence, there doesn't seem to be much description of this.

    I am 100% sure it is not as conveyed by Dickens !
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Jan 5 04:09:34 2026
    From Newsgroup: comp.arch

    On 1/4/2026 4:30 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 1/4/2026 6:07 AM, Terje Mathisen wrote:
    -----merciful snip-----------

    Can fill in patterns sorta like:
    v0=*(u64 *)cs;
    switch(dist)
    {
    case 1:
    v0=v0&0xFF;
    v0|=v0<<8;
    v0|=v0<<16;
    v0|=v0<<32;

    That was 1 instruction in my Samsung GPU ISA

    SWIZ V1,V0,#[0,8,16,24]

    The immediate was broken into 4-bit fields (represented above with
    8-bit fields) and the immediate was used as 8|u 4-bit Mux selectors
    from the other operand. {Not using Samsung ISA names or syntax}.

    When I looked deeply into the situation, it was easier in HW to do::

    for( i = 0; i < 8; i++ )
    out[field[i]] = in[i]

    than::
    for( i = 0; i < 8; i++ )
    out[i] = in[field[i]]

    For some reason we called this swizzle not permute !?!

    Dunno.

    I had something similar for 16-bit elements, but it only has an
    immediate form, and I called it "shuffle".

    For LZ compression, might make sense to have a 3R 8-element byte-shuffle though. Or maybe shuffle across register pairs (sort of like the SHUF
    that existed in SSE).


    Still mostly works, LZ4 decompression is still faster than IO to an
    SDcard (still typically 12.5 MHz SPI or similar; or 1.56 MB/s).


    -------------------
    Otherwise, generally feeling kinda lonely at the moment.
    Working on stuff doesn't entirely compensate for general feelings of
    pointlessness.

    Seriously; get help.


    As far as I understand it, this sort of feeling is pretty much standard though.

    Though, just like the usual idea is that a person can work on stuff and
    not have things feel pointless.

    Then again, I suspect usual idea is that people would have some level of
    IRL social life, or a dating life / significant other, etc. My case, not
    so much...


    Granted, git traffic has gotten a bit sparse. Typically don't get more
    than a handful of cloners per week at this point (there was a time a few
    years ago when I was seeing somewhat higher git clone traffic).


    It is this or working on sci-fi stories of mine, which I had ended up
    partly going and using AI to give commentary partly as I seemingly can't
    get any actual humans to comment on my fiction.

    You are not the first purported author in this position


    I previously wrote and published a few E-Book stories, but no one cared,
    so no sales.

    I tried giving links in various times and places, but generally no one
    offers much actual feedback.


    Getting Grok to write review story and write commentary at least works
    it seems. Or, can give it text and ask questions about it, and use this
    to gauge whether it is sufficiently obvious (never-mind when Grok itself screws up in obvious ways; almost like a human that it is overly
    confident that they are correct, even when they are obviously wrong).

    Usually better than the AI feature in Google though.

    Like, Grok has seemingly crossed into "smart enough to be useful", but
    not really enough that I particularly inclined to use them for code. For something simple, they work, but anything non-trivial and it still often
    falls on its face.

    But, what it lacks in general smartness, is sort of offset by going at superhuman speeds (like, it writes stuff at around 500x faster than what
    a human could write).


    Though, seemingly (contrary to stereotypes), interacting with it still
    doesn't "feel" all that much like interacting with a person. Like, it is
    still feels like interacting with a machine even despite otherwise
    generating human-like text.



    Apparently, I guess as far as the AI was concerned on one of the stories
    I threw at it (one about a Moon AI):
    General ideas/theme were compared to "Asimov's Foundation Series";

    You do know that there are only 39 plots in all of literature ?!?


    Possibly.


    I guess "AI concerned with the long-term survival of Earth-based life,
    so takes measures to preserve its existence past the eventual demise of
    the solar system; proceeds to lead towards long distance interstellar colonization efforts." had some amount of overlap with Asimov's writings...

    Then again, borrowed some ideas from things like the Mega-Man games and Neuromancer and similar which apparently also borrowed things from Asimov.


    I had not imagined the possible emergence of any sort of galactic empire though, where basically the formation of any sort of interstellar power structure is mostly precluded by speed-of-light limits.


    Like, as-is, would take longer than the current age of human
    civilization for a single message to make it across the galaxy.
    And, within this setting, I am assuming a non-existence of FTL.

    I am assuming people using a lot of nuclear-powered plasma engines
    though for interplanetary trips.

    In story, Moon AI develops fusion-powered spacecraft (with a hybrid of a fusion reactor and particle accelerator as an engine). But, say, normal
    humans don't have these.

    Could have gone for nuclear-thermal rockets or similar (as an
    intermediate option) but I didn't. Likely less practical for "planet
    hopping" than plasma engines. Nor likely to be viable for relativistic
    travel; would have high thrust, but to get to relativistic speeds one
    needs exhaust traveling at ~ 0.99c or similar (so, less "very hot
    hydrogen" and more "focused beam of nearly light-speed protons"; not
    many protons, but one can get thrust trying to push them up near light speed...).

    Not seen many sci-fi stories going the "use a particle accelerator as an engine" route though...



    Well, and say, for people on Mars, no real-time internet traffic to
    Earth (more just bidirectional email-like communication).

    ...



    Meanwhile, Grok had doubts that an advanced AI could rewrite organic
    life enough to use 8 base pairs and eat rocks. I would think this would
    be one of the less implausible things I have seen in a lot of Sci-Fi.


    Well, and disagreements over which sorts of gases would be easier to
    obtain on Mars (and use for "air" in a colony):
    My thinking: The air is heavily adulterated with SF6 and CF4, as N2 is a
    lot more sparse. Most of what N2 it had flew off into space, and there
    is little N2 in the rocks due to most nitrogen containing molecules
    being a result of organic processes, which are lacking on Mars. Wheres,
    say, SF6 and CF4 and similar can be synthesized from Martian regolith
    (which is comparably rich Sulfur and Fluorides and similar).

    Then Grok disagrees, arguing that SF6 and CF4 would require chemical synthesis, vs mechanical extraction of N2 and Ar via an air compressor.
    But, I am like, 1% of crap-all isn't much.

    Would probably need to use something like a turbomolecular pump just to
    build up enough pressure on the input side to be able to use something
    like a reciprocating pump or scroll pump, and then like 96% of what you
    are getting is CO2 (and a lot of the rest being Argon, but could at
    least use Argon).

    So, I disagree, thinking that doing chemistry magic on the soil to
    synthesize gas is likely to be able to more readily produce the volumes
    of gas needed to pressurize a Mars colony (in the absence of sending big
    N2 canisters from Earth).

    One thing Mars has a lot of of, is dirt and rocks.

    Lots of perchlorates, but none that really have nitrogen or similar.
    ...

    But, them say, one can be like, well, what if we have, say:
    40% O2, 20% CF4, 20% SF6, 10% N2, 10% Ar
    Then, say, 8 PSI.

    Theoretically, humans should at least be able to survive in this...

    Or, do 4 PSI at 100% O2, but this has its own drawbacks. Makes sense for
    EVAs, but ideally, want higher pressure for extended occupation, but
    then one needs an inert filler gas.

    But, yeah...


    Contrast something like "Star Trek" which is all like "Warp Drives and
    Gravity Plating".

    And, I am sitting around being like, "Where do they get enough gas to
    keep their colonies pressurized?..."




    <snip>
    I am like, OK, fair enough. My influences are varied, but mostly stuff
    like "Ghost in the Shell" and "Mega Man" and similar (both of which have
    a lot trans-humanism themes). It seemingly misses some implicit things
    about how the plot fits together (like the plot relevance of a detour
    into talking about a character doing the whole E-Sports / Pro-Gamer
    thing), or why the conclusion had "Raggedy Ann" references, etc.

    Hint: use fewer words and paragraphs to convey the same amount of information.
    Works in literature and in this NG.

    Well, seemingly one part in my mind is not happy unless things are
    described in enough detail to avoid possible ambiguity...

    And, in interacting with LLMs, they go hopelessly off-track unless one describes things in enough detail that there is little room for ambiguity.

    I am not great at writing things that are both concise and unambiguous.


    ------------------
    Then again, I am not entirely sure how most people experience their own
    existence, there doesn't seem to be much description of this.

    I am 100% sure it is not as conveyed by Dickens !

    Looks it up, I am not sure what Dickens was going on about.



    Decided not go go into it, but it my case it seems like I can carry out several semi-independent chains of thought at the same time, but there
    seems to be a divide of "character", like they are all part of myself,
    and share my same overall identity and experience, but they have slight differences in terms of perspective and don't always fully agree with
    each other.

    This differs some from descriptions which seem like a person's
    experience is "fully singular" with a single chain of thought that goes step-by-step, rather than, say:
    One that is more concerned with rationality and tends to be more
    skeptical of things;
    One that is more fanciful and sees reality as more flexible;
    Another that tends to go down rabbit holes and gets fussy whenever there
    is ambiguity;
    Then "I" exist more as a composite of all of them.



    Then, seemingly "spaces":
    Inner world, more abstract;
    Usually things are more static, abstract,
    mental imagery here is primarily monochrome.
    It is fully 3D though,
    Most imagery looks like light edges across a dark background.
    Like wireframes, or line-art cartoons.
    Color is rare, but exists, and seems to be organized into levels:
    Black/White, dark and light gray (primary)
    cyan, magenta, yellow
    blue, green, red (rarer)
    Seems mostly aligned with a 16 color palette.
    Sensory layers:
    My sensory processing seems to have an "onion like" structure.
    Image layer (actual image / senses);
    Edge layer (mostly edges and shapes);
    Object layer (has objects and similar).
    Seems to update slowly (around 5 Hz).
    Audio is similar:
    Tones / frequencies / basic audio features;
    Verbal sounds/etc;
    Words;
    Also a separate "echo space" that mimics the general area;
    And an audio-tactile space which mostly handles low frequencies.
    Tactile:
    Doesn't have much subdivision;
    Seems to have a lower latency compared with visual processing.

    Visual processing seems to be kinda slow for me and have a noticeable
    lag time, and sometimes senses can get out of sync with each other
    (which is annoying).

    Usually, things like tactile sense is first, followed by audio, followed
    by vision. Like, say, if one feels the effects of something happening
    before they get the visual impressions of the thing that happened. a lot
    of time the delay seems to be in the area of around 500ms or so (and
    when the delay is within 200ms my mind seems better able to pull things
    back into alignment).


    Or, for me:
    Fluid motion perceived: ~16Hz / 62ms
    Strong awareness of low frame-rate: ~ 12 Hz / 83ms.
    Perception of motion breaks down: ~ 5 Hz / 200ms.

    Below 5 Hz, I tend to see a series of discrete images.
    At 8-12 Hz, I am aware of the choppy movement, but can still perceive
    motion.
    At 16-30 Hz, stuff looks fluid motion, but gets "smoother" as framerate increases.
    Much over 30, there is no longer much difference.

    Had noted though that at lower framerates, there is still a visual
    artifact with movement, namely that as an object moves (more so if there
    is a strong contrast), it may have an unstable shimmering edge in the
    leading and trailing edge of the movement.

    Say, for example, a white circle over a black background has a brighter leading edge and darker trailing edge, or sometimes a momentary
    "smearing" effect. This effect is most obvious with high-contrast, such
    as black/white or red/greed.

    Effect mostly goes away if these is some level of motion blur applied
    though.


    A vaguely similar effect happens at high light levels, such as direct sunlight, where any bright surfaces will leave dark trails behind them
    (and in bright conditions, particularly if things are moving around,
    enough trails can build up to make it difficult to see).

    Had noted that these effects can be greatly reduced with dark sunglasses (Shade 4 or 5) or welding goggles (Shade 5 works well), though Shade 7
    or 8 is too dark and impedes visibility (also shade-5 is a little dark
    for indoor uses).

    My vision also tends to have a persistent low-level snow effect (sort of
    like TV static).

    And, sometimes (occasionally) artifacts resembling low-res raster
    patterns; grid patterns, or sometimes text (like, some sort of unstable
    mass of text-like and letter-like features).

    Like, almost like some sort of low-level hallucination, except the
    things being hallucinated are unstable grids of what resembles ASCII
    text, usually embedded within otherwise grainy or noisy backgrounds.

    Seemingly, it is vaguely similar to pareidolia, but:
    Often results in letters rather than faces;
    Most often associated with very noisy visual input;
    Can sometimes manifest unstable forms of other objects.

    Some other effects sometimes look like horizontal or vertical black and
    white bars, etc. Also seems weird. Like, being human, one wold expect
    sensory glitches to be more "natural" or "organic" (as opposed to
    patterns that look almost computer-like).

    Like, almost makes me wonder if looking at a computer enough can alter
    ones' visual processing in such a way as to produce vaguely
    computer-like sensory glitches.

    Like, some I have seen before, say some resembling the sorts of
    compression artifacts one gets with an aggressively quantized Haar
    transforms, but unclear why any Haar-like features would apply to human
    visual processing, etc.




    I have noted that it seems hit or miss whether optical illusions work
    for me:
    Weak or little effect:
    "Hollow Face"
    Sometimes it seems like it starts to work, but then collapses.
    Mostly I just see the rotating face mask or similar.
    But, admittedly, does still look a little weird.
    "Motion induced blindless"
    Many illusions based on lighting, shading, or implied colors.
    Medium effect:
    "Duck-Rabbit"
    But, more rapidly flickers back and forth,
    with "waves" that spread across the object,
    with the waves fighting over which it is.
    I tend not to see one or the other,
    more just rapid/unstable flickering.
    Similar effect with figure-ground illusions.
    More effective:
    Ones based on geometric line patterns.
    Such as "Kanizsa's Triangle"
    Do sort of see a flickering implied triangle.
    Implied colors from patterns of other colors (like in Bayer arrays).
    But, this is "weaker", Gaussian blur may achieve the same effect.
    Things like mach bands and similar
    Banding artifacts are very obvious.
    But, yeah, I think this is why we dither gradients...
    "Hermann Grid Illusion"

    Well, along with a whole lot of other weirdness.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jan 5 13:22:15 2026
    From Newsgroup: comp.arch

    On Mon, 5 Jan 2026 04:09:34 -0600
    BGB <cr88192@gmail.com> wrote:
    On 1/4/2026 4:30 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:


    ------------------
    Then again, I am not entirely sure how most people experience
    their own existence, there doesn't seem to be much description of
    this.

    I am 100% sure it is not as conveyed by Dickens !

    Looks it up, I am not sure what Dickens was going on about.


    rCLA wonderful fact to reflect upon, that every human creature is
    constituted to be that profound secret and mystery to every other.rCY
    I suppose that Sartre wanted to say something else by ""L'enfer, c'est
    les autres". May be, an exact opposite.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Jan 5 13:21:15 2026
    From Newsgroup: comp.arch

    On 1/5/2026 5:22 AM, Michael S wrote:
    On Mon, 5 Jan 2026 04:09:34 -0600
    BGB <cr88192@gmail.com> wrote:

    On 1/4/2026 4:30 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:


    ------------------
    Then again, I am not entirely sure how most people experience
    their own existence, there doesn't seem to be much description of
    this.

    I am 100% sure it is not as conveyed by Dickens !

    Looks it up, I am not sure what Dickens was going on about.



    rCLA wonderful fact to reflect upon, that every human creature is
    constituted to be that profound secret and mystery to every other.rCY

    I suppose that Sartre wanted to say something else by ""L'enfer, c'est
    les autres". May be, an exact opposite.


    It gets confusing, but alas will at least admit that I don't understand
    most people.


    As I can note, my mind seems to be partly divided into multiple
    sub-personas, with different properties, but it is awkward to describe
    them. Socially, people are expected to see themselves as a singular
    entity; admittedly whatever this notion of "singularness" is, isn't particularly strong in my case. I would still be classified as
    self-aware though, as I do recognize my reflection in a mirror, etc.

    Had been a few times confused by mirrors though, as part of my visual processing doesn't want to see the mirror as a reflection of the current space, but as a separate space that extends behind the mirror. At basic levels, my visual processing doesn't seem to recognize myself, but my upper-level thoughts do recognize myself. Moving myself via a reflected
    image is sometimes annoyingly awkward.

    Though, there are enough cultural references to there being worlds
    behind mirrors, and the mirror person being a sort of doppelganger from
    that world that merely parrots ones' actions, that it is likely that
    this particular visual glitch is a fairly common experience.


    But, yeah, seemingly a few of my sub-personas being like:
    1. One is more pessimistic, reason-oriented.
    Skeptical of anything that can't be verified with evidence.
    This one tends to be one of my main outward facing personas.
    Also the one that tends to deal the most with natural language.
    Tends actually to be doubtful of religion,
    but mostly keeps quiet about it.
    If this one expresses its doubts, another (2) gets offended.
    2. Another is more optimistic, emotions oriented, and religious.
    Tends more often to exist in the background.
    Often more of a "leads from the rear" role.
    Is prone to say things the others don't understand,
    like, a lot of symbolism and stuff.
    This ones' grasp on reality is weaker,
    and sometimes prone to New-Age style thinking.
    This one is more easily offended than the others.
    3. A third is the one that is mostly into programming,
    Likes to deep-dive into technical rabbit holes.
    Doesn't really understand emotions and is skeptical of them.
    Little interest in religious matters either way.
    Often little interest other than chasing technical rabbit holes.
    And nit-picking things the others are saying when ambiguous.
    This one really doesn't like things that are ambiguous.
    This one tends to be my secondary outward-facing persona.

    Seemingly, the first two are stronger in terms of ability to control
    outward behavior, though the third is usually more active. These two
    seem to swap places, and usually can't both be active in the
    outer-facing persona at the same time (and these two dislike each other).

    So:
    1,3: Typical configuration.
    1,2: Less common.
    2,3: Rare, but has happened.

    The personas mostly hand-off between each other and try to keep things consistent, but it is a question of if people notice. It is rare to get
    all 3 to fully agree on anything, or to think about the same things.
    Usually, there is a 2/3 majority thing (if 2 out of 3 can agree on
    something, it is the working decision).

    Then there is some amount of dealing with sensory processing, physical
    control tasks, dealing with natural language, etc.

    Stuff often isn't quite as seamless or automatic as would usually seem
    to be implied.

    It seems that it is 1 that mostly controls natural-language stuff, and
    if 1 is not active then language processing needs to be rerouted;
    usually the part of my mind that deals with programming then also has to
    deal with English and similar (normally these places are separate, with English language and code existing in separate places).

    Well, sub-note (as 1, writing this): Difficult to talk about myself in
    the third person in this way; a lot of the external description of 1
    being provided by 3 (3 disagrees with the numbering, thinking he should
    be 1, but I am the one actually doing the writing, so I am 1 here, and 2 agrees with my numbering scheme). Here 3 is complaining about 1 tending
    to get upset and "crash out" sometimes, leaving him to take on 1's role
    (even as annoying as that is), so is the main one left to "keep the ship moving" (as 2 would describe it); seemingly both 2 and 3 agreeing that 1
    isn't always all that reliable here.


    Had noted before that I can seemingly partially decouple my left and
    right sides, in which case:
    Persona 1 seems to take control of right hand;
    Personas 2 and 3 have control of left hand.
    Seemingly visual space and inner-world also partly splits.
    Like, a wall forms and I am left with two mental spaces.

    But, this is weird, and usually better to leave the two sides connected
    (less awkward and more efficient).
    In my normal mode, left hand is dominant (despite persona 1's right-hand preference).


    But, yeah, probably sounds kinda crazy/weird, but, alas...

    ...



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jan 5 22:18:43 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 1/5/2026 5:22 AM, Michael S wrote:
    On Mon, 5 Jan 2026 04:09:34 -0600
    BGB <cr88192@gmail.com> wrote:

    On 1/4/2026 4:30 PM, MitchAlsup wrote:
    ---------------------
    I am 100% sure it is not as conveyed by Dickens !

    Looks it up, I am not sure what Dickens was going on about.



    rCLA wonderful fact to reflect upon, that every human creature is constituted to be that profound secret and mystery to every other.rCY

    I suppose that Sartre wanted to say something else by ""L'enfer, c'est
    les autres". May be, an exact opposite.


    It gets confusing, but alas will at least admit that I don't understand
    most people.


    As I can note, my mind seems to be partly divided into multiple sub-personas, with different properties, but it is awkward to describe
    them. Socially, people are expected to see themselves as a singular
    entity; admittedly whatever this notion of "singularness" is, isn't particularly strong in my case. I would still be classified as
    self-aware though, as I do recognize my reflection in a mirror, etc.

    Sane people think there is a large gap between sane and insane.

    We KNOW otherwise .....
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Jan 5 22:33:55 2026
    From Newsgroup: comp.arch

    On 1/5/2026 4:18 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 1/5/2026 5:22 AM, Michael S wrote:
    On Mon, 5 Jan 2026 04:09:34 -0600
    BGB <cr88192@gmail.com> wrote:

    On 1/4/2026 4:30 PM, MitchAlsup wrote:
    ---------------------
    I am 100% sure it is not as conveyed by Dickens !

    Looks it up, I am not sure what Dickens was going on about.



    rCLA wonderful fact to reflect upon, that every human creature is
    constituted to be that profound secret and mystery to every other.rCY

    I suppose that Sartre wanted to say something else by ""L'enfer, c'est
    les autres". May be, an exact opposite.


    It gets confusing, but alas will at least admit that I don't understand
    most people.


    As I can note, my mind seems to be partly divided into multiple
    sub-personas, with different properties, but it is awkward to describe
    them. Socially, people are expected to see themselves as a singular
    entity; admittedly whatever this notion of "singularness" is, isn't
    particularly strong in my case. I would still be classified as
    self-aware though, as I do recognize my reflection in a mirror, etc.

    Sane people think there is a large gap between sane and insane.

    We KNOW otherwise .....


    Probably true, but seemingly neither psychopathy nor schizophrenia,
    where these are the two major bad ones...


    Then again, how bad ASD is (AKA: autism) likely depends on who you ask.


    After seeing what Grok had to say about a lot of this, got back roughly:
    Likely ASD with some ADHD like features;
    VSS (Visual Snow Syndrome);
    Mild dissociation.

    Also descriptions:
    Personas 1 and 3 likely exist in the left hemisphere,
    Persona 1 is likely mainly in the frontal lobe;
    Strongly associated with frontal lobe functions;
    Persona 3 is likely mainly in the parietal lobe;
    Strongly associated with parietal lobe functions.
    Possibly they represent a split between the dorsal and ventral streams.
    Persona 2 is strongly associated with right hemisphere functions.

    There is likely anomalous behavior in the thalamus, corpus calossum, and occipital lobe; with some features likely tied to excessive computer use (apparently using computer so much that visual system starts adapting to specific patterns within the UI rather than to more "natural" patterns).

    Or, some of it possible side effects of spending a significant part of
    ones waking lifespan looking at text editors?...


    So, seemingly this isn't quite the same as DID / MPD, in that it is more
    like brain regions and pathways starting to operate partially
    independently of each other and forming their own semi-integrated
    experiences partially separate from those of the "greater self".

    Apparently looking into it:
    Seeing noise and other artifacts in visual perception;
    Palinopsia (seeing trails behind things, etc);
    Photosensitivity / photophobia issues;
    Tinnitus;
    etc.
    Being all associated with VSS, which seems mostly consistent with my experience.

    Well, and apparently both the sensory filtering (related to VSS) and large-scale integration functions are both handled by the thalamus (so possibly something has gone a little weird there).

    Not like any of this is particularly new.

    ...



    Otherwise, got around to getting the vector-math stuff working for RV64G
    in BGBCC (and with that, got GLQuake working in BGBCC's RV64 mode). In
    the basic RV64G mode, it exists mainly as scalar instructions and
    runtime calls. There is the relative crappiness of handling 128-bit
    vectors by dumping them to RAM and then reloading elements, doing math, storing elements back, and reloading the result vector on return (the
    GPR/FPR split makes things a pain, and in this case going through RAM
    was less of a hassle). Still a lot of the runtime functions are still
    missing here though (need to implement every operator over every vector
    type, still not done, I just did the ones I needed for TKRA-GL).


    I am debating whether to split the 64-bit and 128-bit SIMD cases for
    BGBCC's SIMD support in RV64 mode. Allowing for a 64-bit only
    implementation is potentially cheaper on the implementation, but would
    be more hassle to deal with in the compiler (though, in this case,
    trying to do a 128-bit op without 128-bit SIMD would just mean splitting
    the instruction into two 64-bit ops).

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Jan 6 01:22:55 2026
    From Newsgroup: comp.arch

    On 2026-01-05 11:33 p.m., BGB wrote:
    On 1/5/2026 4:18 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 1/5/2026 5:22 AM, Michael S wrote:
    On Mon, 5 Jan 2026 04:09:34 -0600
    BGB <cr88192@gmail.com> wrote:

    On 1/4/2026 4:30 PM, MitchAlsup wrote:
    ---------------------
    I am 100% sure it is not as conveyed by Dickens !

    Looks it up, I am not sure what Dickens was going on about.



    rCLA wonderful fact to reflect upon, that every human creature is
    constituted to be that profound secret and mystery to every other.rCY

    I suppose that Sartre wanted to say something else by ""L'enfer, c'est >>>> les autres". May be, an exact opposite.


    It gets confusing, but alas will at least admit that I don't understand
    most people.


    As I can note, my mind seems to be partly divided into multiple
    sub-personas, with different properties, but it is awkward to describe
    them. Socially, people are expected to see themselves as a singular
    entity; admittedly whatever this notion of "singularness" is, isn't
    particularly strong in my case. I would still be classified as
    self-aware though, as I do recognize my reflection in a mirror, etc.

    Sane people think there is a large gap between sane and insane.

    We KNOW otherwise .....


    Probably true, but seemingly neither psychopathy nor schizophrenia,
    where these are the two major bad ones...


    Then again, how bad ASD is (AKA: autism) likely depends on who you ask.


    After seeing what Grok had to say about a lot of this, got back roughly:
    -a Likely ASD with some ADHD like features;
    -a VSS (Visual Snow Syndrome);
    -a Mild dissociation.

    Also descriptions:
    -a Personas 1 and 3 likely exist in the left hemisphere,
    -a-a-a Persona 1 is likely mainly in the frontal lobe;
    -a-a-a-a-a Strongly associated with frontal lobe functions;
    -a-a-a Persona 3 is likely mainly in the parietal lobe;
    -a-a-a-a-a Strongly associated with parietal lobe functions.
    -a-a-a Possibly they represent a split between the dorsal and ventral streams.
    -a Persona 2 is strongly associated with right hemisphere functions.

    There is likely anomalous behavior in the thalamus, corpus calossum, and occipital lobe; with some features likely tied to excessive computer use (apparently using computer so much that visual system starts adapting to specific patterns within the UI rather than to more "natural" patterns).

    Or, some of it possible side effects of spending a significant part of
    ones waking lifespan looking at text editors?...


    So, seemingly this isn't quite the same as DID / MPD, in that it is more like brain regions and pathways starting to operate partially
    independently of each other and forming their own semi-integrated experiences partially separate from those of the "greater self".

    Apparently looking into it:
    -a Seeing noise and other artifacts in visual perception;
    -a Palinopsia (seeing trails behind things, etc);
    -a Photosensitivity / photophobia issues;
    -a Tinnitus;
    -a etc.
    Being all associated with VSS, which seems mostly consistent with my experience.

    Well, and apparently both the sensory filtering (related to VSS) and large-scale integration functions are both handled by the thalamus (so possibly something has gone a little weird there).

    Not like any of this is particularly new.

    ...



    Otherwise, got around to getting the vector-math stuff working for RV64G
    in BGBCC (and with that, got GLQuake working in BGBCC's RV64 mode). In
    the basic RV64G mode, it exists mainly as scalar instructions and
    runtime calls. There is the relative crappiness of handling 128-bit
    vectors by dumping them to RAM and then reloading elements, doing math, storing elements back, and reloading the result vector on return (the GPR/FPR split makes things a pain, and in this case going through RAM
    was less of a hassle). Still a lot of the runtime functions are still missing here though (need to implement every operator over every vector type, still not done, I just did the ones I needed for TKRA-GL).


    I am debating whether to split the 64-bit and 128-bit SIMD cases for
    BGBCC's SIMD support in RV64 mode. Allowing for a 64-bit only
    implementation is potentially cheaper on the implementation, but would
    be more hassle to deal with in the compiler (though, in this case,
    trying to do a 128-bit op without 128-bit SIMD would just mean splitting
    the instruction into two 64-bit ops).

    ...


    I have my own set of mental health issues, many people do. I was told by
    an army doctor that they worry more about people being too sane than
    insane. Everybody is at least a little insane.

    Qupls4 has quasi-variable length instructions by counting postfixes as a
    part of the instruction. They get processed and execute as if they were
    part of the instruction. Since the constant routing information is in
    the postfix and not the instruction, it is not possible to disable
    trailing decoders from the lead. The decoders just treat postfixes as
    NOPs. And NOPs do not make it through dispatch.

    Working on the renamer today, a characteristic it has is it delays the availability of freed register tags by 15 clock cycles so that register
    tags are not reused too quickly. There was a pipelining issue where a
    register tag was used again before the pipeline recognized it as free.
    It takes a couple of clock cycles due to pipelining. Not using a FIFO
    based renamer as I could not get it to work in a corner case. The
    renamer would supply tags used too recently when the FIFO was empty.
    There are also other issues making a FIFO tricky to use.

    For Qupls4 an allocation bitmap is used, find-first-ones, plus an SRL
    based delay line for freed tags.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Jan 6 13:20:25 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    When I looked deeply into the situation, it was easier in HW to do::

    for( i = 0; i < 8; i++ )
    out[field[i]] = in[i]

    than::
    for( i = 0; i < 8; i++ )
    out[i] = in[field[i]]


    That isn't really that surprising:

    This way the inputs are available early and in sequential order, while
    the stores can be allowed to have higher latency, right?

    For some reason we called this swizzle not permute !?!

    I'm assuming collisions would be disallowed? I.e. you can use it to
    splat a single input into all output slots, but you cannot target
    multiple inputs toward the same destination.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jan 6 17:57:28 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:
    When I looked deeply into the situation, it was easier in HW to do::

    for( i = 0; i < 8; i++ )
    out[field[i]] = in[i]

    than::
    for( i = 0; i < 8; i++ )
    out[i] = in[field[i]]


    That isn't really that surprising:

    This way the inputs are available early and in sequential order, while
    the stores can be allowed to have higher latency, right?

    For some reason we called this swizzle not permute !?!

    I'm assuming collisions would be disallowed? I.e. you can use it to
    splat a single input into all output slots, but you cannot target
    multiple inputs toward the same destination.

    The later is why the HW logic is significantly easier.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Jan 6 12:43:34 2026
    From Newsgroup: comp.arch

    On 1/6/2026 11:57 AM, MitchAlsup wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:
    When I looked deeply into the situation, it was easier in HW to do::

    for( i = 0; i < 8; i++ )
    out[field[i]] = in[i]

    than::
    for( i = 0; i < 8; i++ )
    out[i] = in[field[i]]


    That isn't really that surprising:

    This way the inputs are available early and in sequential order, while
    the stores can be allowed to have higher latency, right?

    For some reason we called this swizzle not permute !?!

    I'm assuming collisions would be disallowed? I.e. you can use it to
    splat a single input into all output slots, but you cannot target
    multiple inputs toward the same destination.

    The later is why the HW logic is significantly easier.

    OK, but this does mean that the usability would be somewhat limited, and couldn't be used to generate the same sorts of repeating pattern fills
    needed for LZ decompression.


    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jan 6 19:37:49 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 1/6/2026 11:57 AM, MitchAlsup wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:
    When I looked deeply into the situation, it was easier in HW to do::

    for( i = 0; i < 8; i++ )
    out[field[i]] = in[i]

    than::
    for( i = 0; i < 8; i++ )
    out[i] = in[field[i]]


    That isn't really that surprising:

    This way the inputs are available early and in sequential order, while
    the stores can be allowed to have higher latency, right?

    For some reason we called this swizzle not permute !?!

    I'm assuming collisions would be disallowed? I.e. you can use it to
    splat a single input into all output slots, but you cannot target
    multiple inputs toward the same destination.

    The later is why the HW logic is significantly easier.

    OK, but this does mean that the usability would be somewhat limited, and couldn't be used to generate the same sorts of repeating pattern fills needed for LZ decompression.

    Field[i] was a constant generated by the compiler.

    Do GPUs do much LZ decompression ??


    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Jan 6 16:42:29 2026
    From Newsgroup: comp.arch

    On 12/31/2025 2:23 AM, Robert Finch wrote:
    <snip>

    One would argue that maybe prefixes are themselves wonky, but
    otherwise one needs:
    Instructions that can directly encode the presence of large immediate
    values, etc;
    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one views
    the instruction stream as linear, whereas suffixes add weirdness and
    are effectively retro-causal, and for any fetch to be safe at the end
    of a cache line one would need to prove the non-existence of a suffix;
    so better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction. I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.


    ID stage is likely too late.

    For PC advance, ideally this needs to be known by the IF stage so that
    we can know how to advance PC for the next clock-cycle (for the PF stage).

    Say:
    PF IF ID RF E1 E2 E3 WB
    PF IF ID RF E1 E2 E3 WB
    PF IF ID RF E1 E2 E3 WB

    So, each IF stage producing an updated PC that needs to reach PF within
    the same clock-cycle (so the SRAMs can fetch data for the correct
    cache-line, which happens on a clock-line edge).

    This may also need to MUX PC's from things like the branch-predictor and branch-initiation logic, which then override the normal PC+Step handling generated from the IF->PF path (also typically at a low latency).


    In this case, the end of the IF stage also handles some amount of repacking;
    Possible:
    Right-justifying the fetched instructions;
    16 -> 32 bit repacking (for RV-C)
    Current:
    Renormalization of XG1/XG2/XG3 into the same internal scheme;
    Repacking 48-bit RISC-V ops into internal 64-bit forms;
    ...

    As a partial result of this repacking, the instruction words effectively
    gain a few extra bits (the "internal normalized format" no longer
    fitting entirely into a 32-bit word; where one could almost see it as a
    sort of "extended instruction" that includes both ISAs in a single slightly-larger virtual instruction word).


    One could go further and try to re-normalize the full instruction
    layout, but as noted XG3 and RV would still differ enough as to make
    this annoying (mostly the different encoding spaces and immed formats).

    * zzzzzzz-ooooo-mmmmm-zzz-nnnnn-yy-yyy11
    * zzzz-oooooo-mmmmmm-zzzz-nnnnnn-yy-yyPw


    With a possible normalized format (36-bit):
    * zzzzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyyyPw
    * zzzzzzz-0ooooo-0mmmmm-yzzz-0nnnnn-1yyyy10 (RV Repack)
    * 000zzzz-oooooo-mmmmmm-zzzz-nnnnnn-0yyyyPw (XG3 Repack)

    Couldn't fully unify the encoding space within a single clock cycle
    though (within a reasonable cost budget).


    At present, the decoder handling is to essentially unify the 32-bit
    format for XG1/XG2/XG3 as XG2 with a few tag bits to disambiguate which
    ISA decoding rules should apply for the 32-bit instruction word in
    question. The other option would have been to normalize as XG3, but XG3
    loses some minor functionality from XG1 and XG2.


    I also went against allowing RV and XG3 jumbo prefixes to be mixed.
    Though, it is possible exceptions could be made.

    Wouldn't have needed J52I if XG3 prefixes could have been used with RV
    ops, but can't use XG3 prefixes in RV-C mode, which is part of why I
    ended up resorting to the J52I prefix hack. But, still doesn't fully
    address the issues that exist with hot-patching in this mode.


    Though, looking at options, the "cheapest but fastest" option at present likely being:
    Core that only does XG3, possibly dropping the RV encodings and
    re-adding WEX in its place (though, in such an XG3-Only mode, the 10/11
    modes would otherwise be identical in terms of encoding).

    Or, basically, XG3 being used in a way more like how XG2 was used.

    But, don't really want to create yet-more modes at the moment. XG3 being
    used as superscalar isn't too much more expensive, and arguably more
    flexible given the compiler doesn't need to be aware of pipeline
    scheduling specifics, but can still make use of this when trying to
    shuffle instructions around for efficiency (a mismatch will then merely
    result in a small reduction in efficiency rather than a potential
    inability of the code to run; though for XG2 there was the feature that
    the CPU could fall back to scalar or potential superscalar operation in
    cases where the compiler's bundling was incompatible with what the CPU allowed).

    So, it is possible that in-order superscalar may be better as a general purpose option even if not strictly the cheapest option.


    A case could maybe be made arguing for dropping back down to 32 GPRs
    (with no FPRs) for more cheapness, but as-is, trying to do 128-bit SIMD
    stuff in RV64 mode also tends to quickly run into issues with register pressure.

    Well, and I was just recently having to partly rework the mechanism for:
    v = (__vec4f) { x, y, z, w };
    To not try to load all the registers at the same time, as this was occasionally running out of free dynamic registers with the normal RV
    ABI (and 12 callee-save FPRs doesn't go quite so far when allocating
    pairs of them), which effectively causes the compiler to break.


    It is almost tempting to consider switching RV64 over to the XG3 ABI
    when using SIMD, well, and/or not use SIMD with RV64 because it kinda
    sucks worse than XG3.

    But... Comparably, for the TKRA-GL front-end (using syscalls for the back-end), using runtime calls and similar for vector operations does
    still put a big dent in the framerate for GLQuake (so, some sort of SIMD
    in RV mode may still be needed even if "kinda inferior").


    Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum number of suffixes is present and ensure the cache-line is wide enough.
    Or limit the number of suffixes so they fit into the half cache-line
    used for spanning.


    Difference:
    With a prefix, you know in advance the prefix exists (the prefix is immediately visible);
    With a suffix, can only know it exists if it is visible.

    So, it poses similar issues to those in making superscalar fetch work
    across cache lines, which is made more of a challenge if one wants to
    make superscalar across line boundaries work during I$ miss handling
    rather than during instruction fetch.

    But, if the logic is able to run at I$ miss time, ideally it can see
    whether or not it is a single-wide or multi-wide instruction (say,
    because this logic runs only on a single cache line at a time).


    It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
    the start of an instruction then does not have to worry about accounting
    for a prefix / suffix.


    Usual behavior is that when an interrupt occurs, SPC or similar always
    points at a valid PC, namely one from the pipeline, and usually/ideally,
    the exact instruction on which the fault occurred (though this does get
    a little fiddly with multiple instructions in the pipeline, and the CPU sometimes needs to figure out which stage corresponds to the fault in question; but this is usually more of an issue for I$ TLB misses, which often/usually trigger during a branch).



    For the most part, superscalar works the same either way, with similar
    efficiency. There is a slight efficiency boost if it would be possible
    to dynamically reshuffle ops during fetch. But, this is not currently
    a thing in my case.

    This latter case would apply if, say, a MEM op is followed by non-
    dependent ALU ops, which under current superscalar handling they will
    not co-execute, but it could be possible in theory to swap the ops and
    allow them to co-execute.


    ...


    - anton



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jan 6 23:49:05 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    <snip>

    One would argue that maybe prefixes are themselves wonky, but otherwise one needs:
    Instructions that can directly encode the presence of large immediate values, etc;

    This is the direction of My 66000.

    The instruction stream is a linear stream of words.
    The first word of each instruction encodes its total length.
    What follows the instruction itself are merely constants used as
    operands in the instruction itself. All constants are 1 or 2
    words in length.

    I would not call this means "prefixed" or "suffixed". Generally,
    prefixes and suffixes consume bits of the prefix/suffix so that
    the constant (in my case) is not equal to container size. This
    leads to wonky operand/displacement sizes not equal 2^(3+k).

    Or, the use of suffix-encodings (which is IMHO worse than prefix encodings; at least prefix encodings make intuitive sense if one views
    the instruction stream as linear, whereas suffixes add weirdness and are effectively retro-causal, and for any fetch to be safe at the end of a cache line one would need to prove the non-existence of a suffix; so better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction.

    But not "self identified".

    I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.

    Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum number of suffixes is present and ensure the cache-line is wide enough.
    Or limit the number of suffixes so they fit into the half cache-line
    used for spanning.

    It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
    the start of an instruction then does not have to worry about accounting
    for a prefix / suffix.

    I would have thought that the previous instruction (last one retired) would provide the starting point of the subsequent instruction. This way you don't have to worry about counting prefixes or suffixes.


    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Jan 6 20:34:53 2026
    From Newsgroup: comp.arch

    On 2026-01-06 5:42 p.m., BGB wrote:
    On 12/31/2025 2:23 AM, Robert Finch wrote:
    <snip>

    One would argue that maybe prefixes are themselves wonky, but
    otherwise one needs:
    Instructions that can directly encode the presence of large immediate
    values, etc;
    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one
    views the instruction stream as linear, whereas suffixes add
    weirdness and are effectively retro-causal, and for any fetch to be
    safe at the end of a cache line one would need to prove the non-
    existence of a suffix; so better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction. I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the
    current instruction slot plus a following area of the cache-line to
    the decoder.


    ID stage is likely too late.

    For PC advance, ideally this needs to be known by the IF stage so that
    we can know how to advance PC for the next clock-cycle (for the PF stage).

    Say:
    -a PF IF ID RF E1 E2 E3 WB
    -a-a-a-a PF IF ID RF E1 E2 E3 WB
    -a-a-a-a-a-a-a PF IF ID RF E1 E2 E3 WB

    The PC advance works okay without knowing whether there is a suffix
    present or not. The suffix is treated like a NOP instruction. There is
    no decode required at the fetch stage. The PC can land on a suffix. It
    just always advances by four (N) instructions unless there is a branch.

    So, each IF stage producing an updated PC that needs to reach PF within
    the same clock-cycle (so the SRAMs can fetch data for the correct cache- line, which happens on a clock-line edge).

    This may also need to MUX PC's from things like the branch-predictor and branch-initiation logic, which then override the normal PC+Step handling generated from the IF->PF path (also typically at a low latency).


    In this case, the end of the IF stage also handles some amount of
    repacking;
    -a Possible:
    -a-a-a Right-justifying the fetched instructions;
    -a-a-a 16 -> 32 bit repacking (for RV-C)
    -a Current:
    -a-a-a Renormalization of XG1/XG2/XG3 into the same internal scheme;
    -a-a-a Repacking 48-bit RISC-V ops into internal 64-bit forms;
    -a-a-a ...

    As a partial result of this repacking, the instruction words effectively gain a few extra bits (the "internal normalized format" no longer
    fitting entirely into a 32-bit word; where one could almost see it as a
    sort of "extended instruction" that includes both ISAs in a single slightly-larger virtual instruction word).


    One could go further and try to re-normalize the full instruction
    layout, but as noted XG3 and RV would still differ enough as to make
    this annoying (mostly the different encoding spaces and immed formats).

    * zzzzzzz-ooooo-mmmmm-zzz-nnnnn-yy-yyy11
    * zzzz-oooooo-mmmmmm-zzzz-nnnnnn-yy-yyPw


    With a possible normalized format (36-bit):
    * zzzzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyyyPw
    * zzzzzzz-0ooooo-0mmmmm-yzzz-0nnnnn-1yyyy10 (RV Repack)
    * 000zzzz-oooooo-mmmmmm-zzzz-nnnnnn-0yyyyPw (XG3 Repack)

    Couldn't fully unify the encoding space within a single clock cycle
    though (within a reasonable cost budget).


    At present, the decoder handling is to essentially unify the 32-bit
    format for XG1/XG2/XG3 as XG2 with a few tag bits to disambiguate which
    ISA decoding rules should apply for the 32-bit instruction word in
    question. The other option would have been to normalize as XG3, but XG3 loses some minor functionality from XG1 and XG2.


    I also went against allowing RV and XG3 jumbo prefixes to be mixed.
    Though, it is possible exceptions could be made.

    Wouldn't have needed J52I if XG3 prefixes could have been used with RV
    ops, but can't use XG3 prefixes in RV-C mode, which is part of why I
    ended up resorting to the J52I prefix hack. But, still doesn't fully
    address the issues that exist with hot-patching in this mode.


    Though, looking at options, the "cheapest but fastest" option at present likely being:
    Core that only does XG3, possibly dropping the RV encodings and re-
    adding WEX in its place (though, in such an XG3-Only mode, the 10/11
    modes would otherwise be identical in terms of encoding).

    Or, basically, XG3 being used in a way more like how XG2 was used.

    But, don't really want to create yet-more modes at the moment. XG3 being used as superscalar isn't too much more expensive, and arguably more flexible given the compiler doesn't need to be aware of pipeline
    scheduling specifics, but can still make use of this when trying to
    shuffle instructions around for efficiency (a mismatch will then merely result in a small reduction in efficiency rather than a potential
    inability of the code to run; though for XG2 there was the feature that
    the CPU could fall back to scalar or potential superscalar operation in cases where the compiler's bundling was incompatible with what the CPU allowed).

    So, it is possible that in-order superscalar may be better as a general purpose option even if not strictly the cheapest option.


    A case could maybe be made arguing for dropping back down to 32 GPRs
    (with no FPRs) for more cheapness, but as-is, trying to do 128-bit SIMD stuff in RV64 mode also tends to quickly run into issues with register pressure.

    Well, and I was just recently having to partly rework the mechanism for:
    -a v = (__vec4f) { x, y, z, w };
    To not try to load all the registers at the same time, as this was occasionally running out of free dynamic registers with the normal RV
    ABI (and 12 callee-save FPRs doesn't go quite so far when allocating
    pairs of them), which effectively causes the compiler to break.


    It is almost tempting to consider switching RV64 over to the XG3 ABI
    when using SIMD, well, and/or not use SIMD with RV64 because it kinda
    sucks worse than XG3.

    But... Comparably, for the TKRA-GL front-end (using syscalls for the back-end), using runtime calls and similar for vector operations does
    still put a big dent in the framerate for GLQuake (so, some sort of SIMD
    in RV mode may still be needed even if "kinda inferior").


    Handling suffixes at the end of a cache-line is not too bad if the
    cache already handles instructions spanning a cache line. Assume the
    maximum number of suffixes is present and ensure the cache-line is
    wide enough. Or limit the number of suffixes so they fit into the half
    cache-line used for spanning.


    Difference:
    With a prefix, you know in advance the prefix exists (the prefix is immediately visible);
    With a suffix, can only know it exists if it is visible.

    Instead with a prefix one only knows the instruction exists if it is
    visible. I do not think it makes much difference. Except that it may be
    harder to decode an instruction and look for a prefix that comes before it.


    So, it poses similar issues to those in making superscalar fetch work
    across cache lines, which is made more of a challenge if one wants to
    make superscalar across line boundaries work during I$ miss handling
    rather than during instruction fetch.

    But, if the logic is able to run at I$ miss time, ideally it can see
    whether or not it is a single-wide or multi-wide instruction (say,
    because this logic runs only on a single cache line at a time).


    It is easier to handle interrupts with suffixes. The suffix can just
    be treated as a NOP. Adjusting the position of the hardware interrupt
    to the start of an instruction then does not have to worry about
    accounting for a prefix / suffix.


    Usual behavior is that when an interrupt occurs, SPC or similar always points at a valid PC, namely one from the pipeline, and usually/ideally,
    the exact instruction on which the fault occurred (though this does get
    a little fiddly with multiple instructions in the pipeline, and the CPU sometimes needs to figure out which stage corresponds to the fault in question; but this is usually more of an issue for I$ TLB misses, which often/usually trigger during a branch).



    For the most part, superscalar works the same either way, with
    similar efficiency. There is a slight efficiency boost if it would be
    possible to dynamically reshuffle ops during fetch. But, this is not
    currently a thing in my case.

    This latter case would apply if, say, a MEM op is followed by non-
    dependent ALU ops, which under current superscalar handling they will
    not co-execute, but it could be possible in theory to swap the ops
    and allow them to co-execute.


    ...


    - anton




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Jan 6 23:27:28 2026
    From Newsgroup: comp.arch

    On 1/6/2026 7:34 PM, Robert Finch wrote:
    On 2026-01-06 5:42 p.m., BGB wrote:
    On 12/31/2025 2:23 AM, Robert Finch wrote:
    <snip>

    One would argue that maybe prefixes are themselves wonky, but
    otherwise one needs:
    Instructions that can directly encode the presence of large
    immediate values, etc;
    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one
    views the instruction stream as linear, whereas suffixes add
    weirdness and are effectively retro-causal, and for any fetch to be
    safe at the end of a cache line one would need to prove the non-
    existence of a suffix; so better to not go there).

    I agree with this. Prefixes seem more natural, large numbers
    expanding to the left, suffixes seem like a big-endian approach. But
    I use suffixes for large constants. I think with most VLI constant
    data follows the instruction. I find constant data easier to work
    with that way and they can be processed in the same clock cycle as a
    decode so they do not add to the dynamic instruction count. Just pass
    the current instruction slot plus a following area of the cache-line
    to the decoder.


    ID stage is likely too late.

    For PC advance, ideally this needs to be known by the IF stage so that
    we can know how to advance PC for the next clock-cycle (for the PF
    stage).

    Say:
    -a-a PF IF ID RF E1 E2 E3 WB
    -a-a-a-a-a PF IF ID RF E1 E2 E3 WB
    -a-a-a-a-a-a-a-a PF IF ID RF E1 E2 E3 WB

    The PC advance works okay without knowing whether there is a suffix
    present or not. The suffix is treated like a NOP instruction. There is
    no decode required at the fetch stage. The PC can land on a suffix. It
    just always advances by four (N) instructions unless there is a branch.

    You don't want to burn extra clock cycles stepping over either prefixes
    or suffixes.


    In any case, for either prefixed or suffixed instructions, best to have
    the fetch/decode and PC advance handle the whole thing as a single
    larger instruction.

    Well, likewise for superscalar; which would be rendered moot if one
    spent multiple clock cycles stepping over instruction words.


    <snip>


    It is almost tempting to consider switching RV64 over to the XG3 ABI
    when using SIMD, well, and/or not use SIMD with RV64 because it kinda
    sucks worse than XG3.

    But... Comparably, for the TKRA-GL front-end (using syscalls for the
    back-end), using runtime calls and similar for vector operations does
    still put a big dent in the framerate for GLQuake (so, some sort of
    SIMD in RV mode may still be needed even if "kinda inferior").

    Well, anyways, have since gotten SIMD in RV64 mode working "slightly
    better" but it still kinda sucks vs SIMD in XG3.

    Though the scope is a lot more limited.
    It is RV64G FPU ops, but:
    FADD.S and similar are understood as 2x Binary32 vs 1x Binary32;
    Optional: A rounding mode can specify 4x Binary32 vectors.
    There are FPU PACK instructions:
    These pack two 32-bit low or high elements into a 64-bit result.

    GLQuake .text size:
    XG3 : 570K
    RV64: 682K

    Framerate at start of "start.bsp" (50MHz):
    XG3 : 5 fps
    RV64: 3 fps
    Vs, plain RV64G:
    RV64: 1 fps (and around 800K of .text)

    Still beats a GCC build here, which rolls in at a good solid 0 fps.
    Seemingly stuff can get wonky below 1 fps, but not entirely obvious from
    the code.


    Handling suffixes at the end of a cache-line is not too bad if the
    cache already handles instructions spanning a cache line. Assume the
    maximum number of suffixes is present and ensure the cache-line is
    wide enough. Or limit the number of suffixes so they fit into the
    half cache-line used for spanning.


    Difference:
    With a prefix, you know in advance the prefix exists (the prefix is
    immediately visible);
    With a suffix, can only know it exists if it is visible.

    Instead with a prefix one only knows the instruction exists if it is visible. I do not think it makes much difference. Except that it may be harder to decode an instruction and look for a prefix that comes before it.


    If PC somehow lands between the prefix and the instruction it applies
    to, then something has gone wrong.

    But, even then, checking if PC-4 is a jumbo prefix, or PC-8 is a J52I or similar, isn't exactly a hard task. CPU doesn't need to do this though,
    and if by some chance it does happen, "it is what it is".

    Still better than say:
    x86: Jumping into the middle of an instruction results in incoherent
    garbage from then on (automatic re-alignment will not happen);
    RV-C: Will typically re-align within 3-5 16-bit words.


    With both XG2/XG3 and RV64G+JX, branching into the middle of a prefixed instruction should resolve within 1 or 2 instruction words (32-bit).

    Given in these cases that jumbo-prefixed encodings are the minority,
    even starting at a random 32-bit aligned location is most likely to land
    on a valid instruction boundary. Landing on the second word of a
    misaligned J52I prefix will still align within 2 words (reading one
    garbage instruction, followed by the base-instruction that the prefix
    had been modifying; with a very low probability of reading a false jumbo prefix).



    With XG1, it is most likely to realign within 3 16-bit words.
    It was more likely to realign quickly as '111' is more statistically
    selective than '11', and the probability of seeing multiple false-hit misaligned 32-bit instructions in a row was very low (and, the
    instruction format and encoding-space layout made decoding multiple consecutive misaligned 32-bit instructions very unlikely).

    While RV-C has a similar property (causing the instruction stream to
    realign), it is less selective due to the instruction formats so there
    is a higher probability of seeing two false 32-bit instructions in a row.

    In either case, not seeing a false 32-bit instruction would lead to it
    seeing a 16-bit instruction, which would then (almost inevitably)
    re-align the decoder.


    Within XG2 and XG3, one would not see a false jumbo-prefixed
    instruction, as the instruction stream is encoded as instruction words
    where prefixes and suffixes are mutually exclusive (so, in this way, it
    is almost more like UTF-8; where it can unambiguously re-align the
    byte-stream within a single codepoint).



    This property doesn't exist for x86 though, which would happily just
    result in reading a never-ending stream of misaligned garbage instructions.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Jan 7 01:22:54 2026
    From Newsgroup: comp.arch

    On 1/6/2026 5:49 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    <snip>

    One would argue that maybe prefixes are themselves wonky, but otherwise
    one needs:
    Instructions that can directly encode the presence of large immediate
    values, etc;

    This is the direction of My 66000.

    The instruction stream is a linear stream of words.
    The first word of each instruction encodes its total length.
    What follows the instruction itself are merely constants used as
    operands in the instruction itself. All constants are 1 or 2
    words in length.

    I would not call this means "prefixed" or "suffixed". Generally,
    prefixes and suffixes consume bits of the prefix/suffix so that
    the constant (in my case) is not equal to container size. This
    leads to wonky operand/displacement sizes not equal 2^(3+k).


    OK.

    As can be noted:
    XG2/3: Prefix scheme, 1/2/3 x 32-bit
    The 96-bit cases are determined by two prefixes.
    Requires looking at 2 words to know total length.
    RV64+Jx:
    Total length is known from the first instruction word:
    Base op: 32 bits;
    J21I: 64 bits
    J52I: 96 bits.
    There was a J22+J22+LUI special case,
    but I now consider this as deprecated.
    J52I+ADDI is now considered preferable.

    As for Imm/Disp sizes:
    XG1: 9/33/57
    XG2 and XG3: 10/33/64
    RV+JX: 12/33/64

    For XG1, the 57-bit size was rarely used and only optionally supported,
    mostly because of the great "crap all of immediate values between 34 and
    62 bits" gulf.


    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one views
    the instruction stream as linear, whereas suffixes add weirdness and are >>> effectively retro-causal, and for any fetch to be safe at the end of a
    cache line one would need to prove the non-existence of a suffix; so
    better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction.

    But not "self identified".


    Yeah, if you can't know whether or not more instruction follows after
    the first word by looking at the first word, this is a drawback.

    Also, if you have to look at some special combination of register
    specifiers and/or a lot of other bits, this is also a problem.


    I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the current
    instruction slot plus a following area of the cache-line to the decoder.

    Handling suffixes at the end of a cache-line is not too bad if the cache
    already handles instructions spanning a cache line. Assume the maximum
    number of suffixes is present and ensure the cache-line is wide enough.
    Or limit the number of suffixes so they fit into the half cache-line
    used for spanning.

    It is easier to handle interrupts with suffixes. The suffix can just be
    treated as a NOP. Adjusting the position of the hardware interrupt to
    the start of an instruction then does not have to worry about accounting
    for a prefix / suffix.

    I would have thought that the previous instruction (last one retired) would provide the starting point of the subsequent instruction. This way you don't have to worry about counting prefixes or suffixes.


    Yeah.

    My thinking is, typical advance:
    IF figures out how much to advance;
    Next instruction gets PC+Step.

    Then interrupt:
    Figure out which position in the pipeline interrupt starts from;
    Start there, flushing the rest of the pipeline;
    For a faulting instruction, this is typically the EX1 or EX2 stage.
    EX1 if it is a TRAP or SYSCALL;
    EX2 if it is a TLB miss or similar;
    Unless EX2 is not a valid spot (flush or bubble),
    then look for a spot that is not a flush or bubble.
    This case usually happens for branch-related TLB misses.]

    Usually EX3 or WB is too old, as it would mean re-running previous instructions.

    Getting the exact stage-timing correct for interrupts is a little
    fiddly, but worrying about prefix/suffix/etc issues with interrupts
    isn't usually an issue, except that if somehow PC ended up pointing
    inside another instruction, I would consider this a fault.

    Usually for sake of branch-calculations in XG3 and RV, it is relative to
    the BasePC before the prefix in the case of prefixed encodings. This
    differs from XG1 and XG2 which defined branches relative to the PC of
    the following instruction.

    Though, this difference was partly due to a combination of
    implementation reasons and for consistency with RISC-V (when using a
    shared encoding space, makes sense if all the branches define PC
    displacements in a consistent way).


    Though, there is the difference that XG3's branches use a 32-bit scale
    rather than a 16-bit scale. Well, and unlike RV's displacements, they
    are not horrible confetti (*1).

    *1: One can try to write a new RV decoder, and then place bets on
    whether they will get JAL and Bcc encodings correct on the first try.
    IME, almost invariably, one will screw these up in some way on the first attempt. Like, JAL's displacement encoding is "the gift that keeps on
    giving" in this sense.

    Like, they were like:
    ADDI / Load:
    Yay, contiguous bits;
    Store:
    Well, swap the registers around and put the disp where Rd went.
    Bcc:
    Well, take the Store disp and just shuffle around a few more bits;
    JAL:
    Well, now there are some more bits, and Rd is back, ...
    Why not keep some of the bits from Bcc,
    but stick everything else in random places?...
    Well, I guess some share the relative positions as LUI, but, ...

    Not perfect in XG3 either, but still:
    { opw[5] ? 11'h7FF : 11'h000, opw[11:6], opw[31:16] }
    Is nowhere near the same level of nasty...


    Well, nevermind if actual decoder has the reverse issue:
    In the VL core and JX2VM, it was internally repacked back into XG2 form internally, which means a little bit of hair going on here. Also I was originally going to relocate it in the encoding space, but ended up
    moving back to its original location as for reasons (mostly due to
    sharing the same decoder) having BRA/BSR in two different locations
    would have effectively burned more encoding space than just leaving it
    where it had been in XG1/XG2 (even if having BRA/BSR in the F0 block is
    "kinda stupid" given it is "very much not a 3R instruction", but, ...).



    At least in most other instructions, the imm/disp bits remain
    contiguous. I instead differed by making the Rn/Rd spot be used as a
    source register by some instructions (taking on the role of Rt/Rs2), but
    IMO this is the lesser of two evils. Would rather have an Rd that is
    sometimes a source, than imm/disp fields that change chaotically from
    one instruction to another.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Jan 7 05:34:28 2026
    From Newsgroup: comp.arch

    On 2026-01-07 2:22 a.m., BGB wrote:
    On 1/6/2026 5:49 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    <snip>

    One would argue that maybe prefixes are themselves wonky, but otherwise >>>> one needs:
    Instructions that can directly encode the presence of large immediate
    values, etc;

    This is the direction of My 66000.

    The instruction stream is a linear stream of words.
    The first word of each instruction encodes its total length.
    What follows the instruction itself are merely constants used as
    operands in the instruction itself. All constants are 1 or 2
    words in length.

    I would not call this means "prefixed" or "suffixed". Generally,
    prefixes and suffixes consume bits of the prefix/suffix so that
    the constant (in my case) is not equal to container size. This
    leads to wonky operand/displacement sizes not equal 2^(3+k).


    OK.

    As can be noted:
    -a XG2/3: Prefix scheme, 1/2/3 x 32-bit
    -a-a-a The 96-bit cases are determined by two prefixes.
    -a-a-a Requires looking at 2 words to know total length.
    -a RV64+Jx:
    -a-a-a Total length is known from the first instruction word:
    -a-a-a-a-a Base op: 32 bits;
    -a-a-a-a-a J21I: 64 bits
    -a-a-a-a-a J52I: 96 bits.
    -a-a-a There was a J22+J22+LUI special case,
    -a-a-a-a-a but I now consider this as deprecated.
    -a-a-a-a-a J52I+ADDI is now considered preferable.

    As for Imm/Disp sizes:
    -a XG1: 9/33/57
    -a XG2 and XG3: 10/33/64
    -a RV+JX: 12/33/64

    For XG1, the 57-bit size was rarely used and only optionally supported, mostly because of the great "crap all of immediate values between 34 and
    62 bits" gulf.


    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one views >>>> the instruction stream as linear, whereas suffixes add weirdness and
    are
    effectively retro-causal, and for any fetch to be safe at the end of a >>>> cache line one would need to prove the non-existence of a suffix; so
    better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction.

    But not "self identified".


    Yeah, if you can't know whether or not more instruction follows after
    the first word by looking at the first word, this is a drawback.

    I do not find having to look at the second word much of a drawback.
    There is not much difference looking at either the first or second word.
    The words are all sitting available on the cache-line.
    Large constants are treated as more of the exceptional case in Qupls4.
    The immediate mode instructions can handle 28-bit constants. One suffix expands that out to 64-bits.
    By placing the constant info in the suffix, Qupls4 gains bits in the instruction that can be used for other purposes rather than handling
    large constants.
    Using a suffix (or prefix) does lead to odd sized constants, but as long
    as they are large enough so what.

    Also, if you have to look at some special combination of register
    specifiers and/or a lot of other bits, this is also a problem.

    I do not know. It depends how it is handled. Qups4 decodes r63 as the
    constant zero, so it is a special register spec, like r0 in many
    machines. r62 gets decoded as the IP value.
    I think the constant decode is not likely on the timing critical path
    provided it is semi-sane.
    Currently on the timing path for Qupls4 is expanding out instructions to multiple micro-ops. I think it needs another pipeline stage.


    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the current
    instruction slot plus a following area of the cache-line to the decoder. >>>
    Handling suffixes at the end of a cache-line is not too bad if the cache >>> already handles instructions spanning a cache line. Assume the maximum
    number of suffixes is present and ensure the cache-line is wide enough.
    Or limit the number of suffixes so they fit into the half cache-line
    used for spanning.

    It is easier to handle interrupts with suffixes. The suffix can just be
    treated as a NOP. Adjusting the position of the hardware interrupt to
    the start of an instruction then does not have to worry about accounting >>> for a prefix / suffix.

    I would have thought that the previous instruction (last one retired)
    would
    provide the starting point of the subsequent instruction. This way you
    don't
    have to worry about counting prefixes or suffixes.


    Yeah.

    My thinking is, typical advance:
    -a IF figures out how much to advance;
    -a Next instruction gets PC+Step.

    Qupls4 does not bother figuring out how much to advance; it would be too
    slow. It just assumes an increment. Why figure it out? If there are instructions in a bundle just advance to the next bundle. I found the IP selection on the timing critical path, the BTB had to be adjusted.

    Then interrupt:
    -a Figure out which position in the pipeline interrupt starts from;
    -a Start there, flushing the rest of the pipeline;
    -a For a faulting instruction, this is typically the EX1 or EX2 stage.
    -a-a-a EX1 if it is a TRAP or SYSCALL;
    -a-a-a EX2 if it is a TLB miss or similar;
    -a-a-a-a-a Unless EX2 is not a valid spot (flush or bubble),
    -a-a-a-a-a-a-a then look for a spot that is not a flush or bubble.
    -a-a-a-a-a This case usually happens for branch-related TLB misses.]

    Usually EX3 or WB is too old, as it would mean re-running previous instructions.

    Getting the exact stage-timing correct for interrupts is a little
    fiddly, but worrying about prefix/suffix/etc issues with interrupts
    isn't usually an issue, except that if somehow PC ended up pointing
    inside another instruction, I would consider this a fault.

    Usually for sake of branch-calculations in XG3 and RV, it is relative to
    the BasePC before the prefix in the case of prefixed encodings. This
    differs from XG1 and XG2 which defined branches relative to the PC of
    the following instruction.

    Though, this difference was partly due to a combination of
    implementation reasons and for consistency with RISC-V (when using a
    shared encoding space, makes sense if all the branches define PC displacements in a consistent way).


    Though, there is the difference that XG3's branches use a 32-bit scale rather than a 16-bit scale. Well, and unlike RV's displacements, they
    are not horrible confetti (*1).

    *1: One can try to write a new RV decoder, and then place bets on
    whether they will get JAL and Bcc encodings correct on the first try.
    IME, almost invariably, one will screw these up in some way on the first attempt. Like, JAL's displacement encoding is "the gift that keeps on giving" in this sense.

    Like, they were like:
    -a ADDI / Load:
    -a-a-a Yay, contiguous bits;
    -a Store:
    -a-a-a Well, swap the registers around and put the disp where Rd went.
    -a Bcc:
    -a-a-a Well, take the Store disp and just shuffle around a few more bits;
    -a JAL:
    -a-a-a Well, now there are some more bits, and Rd is back, ...
    -a-a-a Why not keep some of the bits from Bcc,
    -a-a-a-a-a but stick everything else in random places?...
    -a-a-a Well, I guess some share the relative positions as LUI, but, ...

    Not perfect in XG3 either, but still:
    -a { opw[5] ? 11'h7FF : 11'h000, opw[11:6], opw[31:16] }
    Is nowhere near the same level of nasty...

    Yeah, They may have gone a bit overboard trying to keep the constant
    bits in the same position for RISCV. It does make things smaller.

    Well, nevermind if actual decoder has the reverse issue:
    In the VL core and JX2VM, it was internally repacked back into XG2 form internally, which means a little bit of hair going on here. Also I was originally going to relocate it in the encoding space, but ended up
    moving back to its original location as for reasons (mostly due to
    sharing the same decoder) having BRA/BSR in two different locations
    would have effectively burned more encoding space than just leaving it
    where it had been in XG1/XG2 (even if having BRA/BSR in the F0 block is "kinda stupid" given it is "very much not a 3R instruction", but, ...).



    At least in most other instructions, the imm/disp bits remain
    contiguous. I instead differed by making the Rn/Rd spot be used as a
    source register by some instructions (taking on the role of Rt/Rs2), but
    IMO this is the lesser of two evils. Would rather have an Rd that is sometimes a source, than imm/disp fields that change chaotically from
    one instruction to another.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2