• Re: Variable-length instructions

    From BGB@cr88192@gmail.com to comp.arch on Tue Jan 6 16:42:29 2026
    From Newsgroup: comp.arch

    On 12/31/2025 2:23 AM, Robert Finch wrote:
    <snip>

    One would argue that maybe prefixes are themselves wonky, but
    otherwise one needs:
    Instructions that can directly encode the presence of large immediate
    values, etc;
    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one views
    the instruction stream as linear, whereas suffixes add weirdness and
    are effectively retro-causal, and for any fetch to be safe at the end
    of a cache line one would need to prove the non-existence of a suffix;
    so better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction. I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.


    ID stage is likely too late.

    For PC advance, ideally this needs to be known by the IF stage so that
    we can know how to advance PC for the next clock-cycle (for the PF stage).

    Say:
    PF IF ID RF E1 E2 E3 WB
    PF IF ID RF E1 E2 E3 WB
    PF IF ID RF E1 E2 E3 WB

    So, each IF stage producing an updated PC that needs to reach PF within
    the same clock-cycle (so the SRAMs can fetch data for the correct
    cache-line, which happens on a clock-line edge).

    This may also need to MUX PC's from things like the branch-predictor and branch-initiation logic, which then override the normal PC+Step handling generated from the IF->PF path (also typically at a low latency).


    In this case, the end of the IF stage also handles some amount of repacking;
    Possible:
    Right-justifying the fetched instructions;
    16 -> 32 bit repacking (for RV-C)
    Current:
    Renormalization of XG1/XG2/XG3 into the same internal scheme;
    Repacking 48-bit RISC-V ops into internal 64-bit forms;
    ...

    As a partial result of this repacking, the instruction words effectively
    gain a few extra bits (the "internal normalized format" no longer
    fitting entirely into a 32-bit word; where one could almost see it as a
    sort of "extended instruction" that includes both ISAs in a single slightly-larger virtual instruction word).


    One could go further and try to re-normalize the full instruction
    layout, but as noted XG3 and RV would still differ enough as to make
    this annoying (mostly the different encoding spaces and immed formats).

    * zzzzzzz-ooooo-mmmmm-zzz-nnnnn-yy-yyy11
    * zzzz-oooooo-mmmmmm-zzzz-nnnnnn-yy-yyPw


    With a possible normalized format (36-bit):
    * zzzzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyyyPw
    * zzzzzzz-0ooooo-0mmmmm-yzzz-0nnnnn-1yyyy10 (RV Repack)
    * 000zzzz-oooooo-mmmmmm-zzzz-nnnnnn-0yyyyPw (XG3 Repack)

    Couldn't fully unify the encoding space within a single clock cycle
    though (within a reasonable cost budget).


    At present, the decoder handling is to essentially unify the 32-bit
    format for XG1/XG2/XG3 as XG2 with a few tag bits to disambiguate which
    ISA decoding rules should apply for the 32-bit instruction word in
    question. The other option would have been to normalize as XG3, but XG3
    loses some minor functionality from XG1 and XG2.


    I also went against allowing RV and XG3 jumbo prefixes to be mixed.
    Though, it is possible exceptions could be made.

    Wouldn't have needed J52I if XG3 prefixes could have been used with RV
    ops, but can't use XG3 prefixes in RV-C mode, which is part of why I
    ended up resorting to the J52I prefix hack. But, still doesn't fully
    address the issues that exist with hot-patching in this mode.


    Though, looking at options, the "cheapest but fastest" option at present likely being:
    Core that only does XG3, possibly dropping the RV encodings and
    re-adding WEX in its place (though, in such an XG3-Only mode, the 10/11
    modes would otherwise be identical in terms of encoding).

    Or, basically, XG3 being used in a way more like how XG2 was used.

    But, don't really want to create yet-more modes at the moment. XG3 being
    used as superscalar isn't too much more expensive, and arguably more
    flexible given the compiler doesn't need to be aware of pipeline
    scheduling specifics, but can still make use of this when trying to
    shuffle instructions around for efficiency (a mismatch will then merely
    result in a small reduction in efficiency rather than a potential
    inability of the code to run; though for XG2 there was the feature that
    the CPU could fall back to scalar or potential superscalar operation in
    cases where the compiler's bundling was incompatible with what the CPU allowed).

    So, it is possible that in-order superscalar may be better as a general purpose option even if not strictly the cheapest option.


    A case could maybe be made arguing for dropping back down to 32 GPRs
    (with no FPRs) for more cheapness, but as-is, trying to do 128-bit SIMD
    stuff in RV64 mode also tends to quickly run into issues with register pressure.

    Well, and I was just recently having to partly rework the mechanism for:
    v = (__vec4f) { x, y, z, w };
    To not try to load all the registers at the same time, as this was occasionally running out of free dynamic registers with the normal RV
    ABI (and 12 callee-save FPRs doesn't go quite so far when allocating
    pairs of them), which effectively causes the compiler to break.


    It is almost tempting to consider switching RV64 over to the XG3 ABI
    when using SIMD, well, and/or not use SIMD with RV64 because it kinda
    sucks worse than XG3.

    But... Comparably, for the TKRA-GL front-end (using syscalls for the back-end), using runtime calls and similar for vector operations does
    still put a big dent in the framerate for GLQuake (so, some sort of SIMD
    in RV mode may still be needed even if "kinda inferior").


    Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum number of suffixes is present and ensure the cache-line is wide enough.
    Or limit the number of suffixes so they fit into the half cache-line
    used for spanning.


    Difference:
    With a prefix, you know in advance the prefix exists (the prefix is immediately visible);
    With a suffix, can only know it exists if it is visible.

    So, it poses similar issues to those in making superscalar fetch work
    across cache lines, which is made more of a challenge if one wants to
    make superscalar across line boundaries work during I$ miss handling
    rather than during instruction fetch.

    But, if the logic is able to run at I$ miss time, ideally it can see
    whether or not it is a single-wide or multi-wide instruction (say,
    because this logic runs only on a single cache line at a time).


    It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
    the start of an instruction then does not have to worry about accounting
    for a prefix / suffix.


    Usual behavior is that when an interrupt occurs, SPC or similar always
    points at a valid PC, namely one from the pipeline, and usually/ideally,
    the exact instruction on which the fault occurred (though this does get
    a little fiddly with multiple instructions in the pipeline, and the CPU sometimes needs to figure out which stage corresponds to the fault in question; but this is usually more of an issue for I$ TLB misses, which often/usually trigger during a branch).



    For the most part, superscalar works the same either way, with similar
    efficiency. There is a slight efficiency boost if it would be possible
    to dynamically reshuffle ops during fetch. But, this is not currently
    a thing in my case.

    This latter case would apply if, say, a MEM op is followed by non-
    dependent ALU ops, which under current superscalar handling they will
    not co-execute, but it could be possible in theory to swap the ops and
    allow them to co-execute.


    ...


    - anton



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jan 6 23:49:05 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    <snip>

    One would argue that maybe prefixes are themselves wonky, but otherwise one needs:
    Instructions that can directly encode the presence of large immediate values, etc;

    This is the direction of My 66000.

    The instruction stream is a linear stream of words.
    The first word of each instruction encodes its total length.
    What follows the instruction itself are merely constants used as
    operands in the instruction itself. All constants are 1 or 2
    words in length.

    I would not call this means "prefixed" or "suffixed". Generally,
    prefixes and suffixes consume bits of the prefix/suffix so that
    the constant (in my case) is not equal to container size. This
    leads to wonky operand/displacement sizes not equal 2^(3+k).

    Or, the use of suffix-encodings (which is IMHO worse than prefix encodings; at least prefix encodings make intuitive sense if one views
    the instruction stream as linear, whereas suffixes add weirdness and are effectively retro-causal, and for any fetch to be safe at the end of a cache line one would need to prove the non-existence of a suffix; so better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction.

    But not "self identified".

    I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.

    Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum number of suffixes is present and ensure the cache-line is wide enough.
    Or limit the number of suffixes so they fit into the half cache-line
    used for spanning.

    It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
    the start of an instruction then does not have to worry about accounting
    for a prefix / suffix.

    I would have thought that the previous instruction (last one retired) would provide the starting point of the subsequent instruction. This way you don't have to worry about counting prefixes or suffixes.


    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Jan 6 20:34:53 2026
    From Newsgroup: comp.arch

    On 2026-01-06 5:42 p.m., BGB wrote:
    On 12/31/2025 2:23 AM, Robert Finch wrote:
    <snip>

    One would argue that maybe prefixes are themselves wonky, but
    otherwise one needs:
    Instructions that can directly encode the presence of large immediate
    values, etc;
    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one
    views the instruction stream as linear, whereas suffixes add
    weirdness and are effectively retro-causal, and for any fetch to be
    safe at the end of a cache line one would need to prove the non-
    existence of a suffix; so better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction. I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the
    current instruction slot plus a following area of the cache-line to
    the decoder.


    ID stage is likely too late.

    For PC advance, ideally this needs to be known by the IF stage so that
    we can know how to advance PC for the next clock-cycle (for the PF stage).

    Say:
    -a PF IF ID RF E1 E2 E3 WB
    -a-a-a-a PF IF ID RF E1 E2 E3 WB
    -a-a-a-a-a-a-a PF IF ID RF E1 E2 E3 WB

    The PC advance works okay without knowing whether there is a suffix
    present or not. The suffix is treated like a NOP instruction. There is
    no decode required at the fetch stage. The PC can land on a suffix. It
    just always advances by four (N) instructions unless there is a branch.

    So, each IF stage producing an updated PC that needs to reach PF within
    the same clock-cycle (so the SRAMs can fetch data for the correct cache- line, which happens on a clock-line edge).

    This may also need to MUX PC's from things like the branch-predictor and branch-initiation logic, which then override the normal PC+Step handling generated from the IF->PF path (also typically at a low latency).


    In this case, the end of the IF stage also handles some amount of
    repacking;
    -a Possible:
    -a-a-a Right-justifying the fetched instructions;
    -a-a-a 16 -> 32 bit repacking (for RV-C)
    -a Current:
    -a-a-a Renormalization of XG1/XG2/XG3 into the same internal scheme;
    -a-a-a Repacking 48-bit RISC-V ops into internal 64-bit forms;
    -a-a-a ...

    As a partial result of this repacking, the instruction words effectively gain a few extra bits (the "internal normalized format" no longer
    fitting entirely into a 32-bit word; where one could almost see it as a
    sort of "extended instruction" that includes both ISAs in a single slightly-larger virtual instruction word).


    One could go further and try to re-normalize the full instruction
    layout, but as noted XG3 and RV would still differ enough as to make
    this annoying (mostly the different encoding spaces and immed formats).

    * zzzzzzz-ooooo-mmmmm-zzz-nnnnn-yy-yyy11
    * zzzz-oooooo-mmmmmm-zzzz-nnnnnn-yy-yyPw


    With a possible normalized format (36-bit):
    * zzzzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyyyPw
    * zzzzzzz-0ooooo-0mmmmm-yzzz-0nnnnn-1yyyy10 (RV Repack)
    * 000zzzz-oooooo-mmmmmm-zzzz-nnnnnn-0yyyyPw (XG3 Repack)

    Couldn't fully unify the encoding space within a single clock cycle
    though (within a reasonable cost budget).


    At present, the decoder handling is to essentially unify the 32-bit
    format for XG1/XG2/XG3 as XG2 with a few tag bits to disambiguate which
    ISA decoding rules should apply for the 32-bit instruction word in
    question. The other option would have been to normalize as XG3, but XG3 loses some minor functionality from XG1 and XG2.


    I also went against allowing RV and XG3 jumbo prefixes to be mixed.
    Though, it is possible exceptions could be made.

    Wouldn't have needed J52I if XG3 prefixes could have been used with RV
    ops, but can't use XG3 prefixes in RV-C mode, which is part of why I
    ended up resorting to the J52I prefix hack. But, still doesn't fully
    address the issues that exist with hot-patching in this mode.


    Though, looking at options, the "cheapest but fastest" option at present likely being:
    Core that only does XG3, possibly dropping the RV encodings and re-
    adding WEX in its place (though, in such an XG3-Only mode, the 10/11
    modes would otherwise be identical in terms of encoding).

    Or, basically, XG3 being used in a way more like how XG2 was used.

    But, don't really want to create yet-more modes at the moment. XG3 being used as superscalar isn't too much more expensive, and arguably more flexible given the compiler doesn't need to be aware of pipeline
    scheduling specifics, but can still make use of this when trying to
    shuffle instructions around for efficiency (a mismatch will then merely result in a small reduction in efficiency rather than a potential
    inability of the code to run; though for XG2 there was the feature that
    the CPU could fall back to scalar or potential superscalar operation in cases where the compiler's bundling was incompatible with what the CPU allowed).

    So, it is possible that in-order superscalar may be better as a general purpose option even if not strictly the cheapest option.


    A case could maybe be made arguing for dropping back down to 32 GPRs
    (with no FPRs) for more cheapness, but as-is, trying to do 128-bit SIMD stuff in RV64 mode also tends to quickly run into issues with register pressure.

    Well, and I was just recently having to partly rework the mechanism for:
    -a v = (__vec4f) { x, y, z, w };
    To not try to load all the registers at the same time, as this was occasionally running out of free dynamic registers with the normal RV
    ABI (and 12 callee-save FPRs doesn't go quite so far when allocating
    pairs of them), which effectively causes the compiler to break.


    It is almost tempting to consider switching RV64 over to the XG3 ABI
    when using SIMD, well, and/or not use SIMD with RV64 because it kinda
    sucks worse than XG3.

    But... Comparably, for the TKRA-GL front-end (using syscalls for the back-end), using runtime calls and similar for vector operations does
    still put a big dent in the framerate for GLQuake (so, some sort of SIMD
    in RV mode may still be needed even if "kinda inferior").


    Handling suffixes at the end of a cache-line is not too bad if the
    cache already handles instructions spanning a cache line. Assume the
    maximum number of suffixes is present and ensure the cache-line is
    wide enough. Or limit the number of suffixes so they fit into the half
    cache-line used for spanning.


    Difference:
    With a prefix, you know in advance the prefix exists (the prefix is immediately visible);
    With a suffix, can only know it exists if it is visible.

    Instead with a prefix one only knows the instruction exists if it is
    visible. I do not think it makes much difference. Except that it may be
    harder to decode an instruction and look for a prefix that comes before it.


    So, it poses similar issues to those in making superscalar fetch work
    across cache lines, which is made more of a challenge if one wants to
    make superscalar across line boundaries work during I$ miss handling
    rather than during instruction fetch.

    But, if the logic is able to run at I$ miss time, ideally it can see
    whether or not it is a single-wide or multi-wide instruction (say,
    because this logic runs only on a single cache line at a time).


    It is easier to handle interrupts with suffixes. The suffix can just
    be treated as a NOP. Adjusting the position of the hardware interrupt
    to the start of an instruction then does not have to worry about
    accounting for a prefix / suffix.


    Usual behavior is that when an interrupt occurs, SPC or similar always points at a valid PC, namely one from the pipeline, and usually/ideally,
    the exact instruction on which the fault occurred (though this does get
    a little fiddly with multiple instructions in the pipeline, and the CPU sometimes needs to figure out which stage corresponds to the fault in question; but this is usually more of an issue for I$ TLB misses, which often/usually trigger during a branch).



    For the most part, superscalar works the same either way, with
    similar efficiency. There is a slight efficiency boost if it would be
    possible to dynamically reshuffle ops during fetch. But, this is not
    currently a thing in my case.

    This latter case would apply if, say, a MEM op is followed by non-
    dependent ALU ops, which under current superscalar handling they will
    not co-execute, but it could be possible in theory to swap the ops
    and allow them to co-execute.


    ...


    - anton




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Jan 6 23:27:28 2026
    From Newsgroup: comp.arch

    On 1/6/2026 7:34 PM, Robert Finch wrote:
    On 2026-01-06 5:42 p.m., BGB wrote:
    On 12/31/2025 2:23 AM, Robert Finch wrote:
    <snip>

    One would argue that maybe prefixes are themselves wonky, but
    otherwise one needs:
    Instructions that can directly encode the presence of large
    immediate values, etc;
    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one
    views the instruction stream as linear, whereas suffixes add
    weirdness and are effectively retro-causal, and for any fetch to be
    safe at the end of a cache line one would need to prove the non-
    existence of a suffix; so better to not go there).

    I agree with this. Prefixes seem more natural, large numbers
    expanding to the left, suffixes seem like a big-endian approach. But
    I use suffixes for large constants. I think with most VLI constant
    data follows the instruction. I find constant data easier to work
    with that way and they can be processed in the same clock cycle as a
    decode so they do not add to the dynamic instruction count. Just pass
    the current instruction slot plus a following area of the cache-line
    to the decoder.


    ID stage is likely too late.

    For PC advance, ideally this needs to be known by the IF stage so that
    we can know how to advance PC for the next clock-cycle (for the PF
    stage).

    Say:
    -a-a PF IF ID RF E1 E2 E3 WB
    -a-a-a-a-a PF IF ID RF E1 E2 E3 WB
    -a-a-a-a-a-a-a-a PF IF ID RF E1 E2 E3 WB

    The PC advance works okay without knowing whether there is a suffix
    present or not. The suffix is treated like a NOP instruction. There is
    no decode required at the fetch stage. The PC can land on a suffix. It
    just always advances by four (N) instructions unless there is a branch.

    You don't want to burn extra clock cycles stepping over either prefixes
    or suffixes.


    In any case, for either prefixed or suffixed instructions, best to have
    the fetch/decode and PC advance handle the whole thing as a single
    larger instruction.

    Well, likewise for superscalar; which would be rendered moot if one
    spent multiple clock cycles stepping over instruction words.


    <snip>


    It is almost tempting to consider switching RV64 over to the XG3 ABI
    when using SIMD, well, and/or not use SIMD with RV64 because it kinda
    sucks worse than XG3.

    But... Comparably, for the TKRA-GL front-end (using syscalls for the
    back-end), using runtime calls and similar for vector operations does
    still put a big dent in the framerate for GLQuake (so, some sort of
    SIMD in RV mode may still be needed even if "kinda inferior").

    Well, anyways, have since gotten SIMD in RV64 mode working "slightly
    better" but it still kinda sucks vs SIMD in XG3.

    Though the scope is a lot more limited.
    It is RV64G FPU ops, but:
    FADD.S and similar are understood as 2x Binary32 vs 1x Binary32;
    Optional: A rounding mode can specify 4x Binary32 vectors.
    There are FPU PACK instructions:
    These pack two 32-bit low or high elements into a 64-bit result.

    GLQuake .text size:
    XG3 : 570K
    RV64: 682K

    Framerate at start of "start.bsp" (50MHz):
    XG3 : 5 fps
    RV64: 3 fps
    Vs, plain RV64G:
    RV64: 1 fps (and around 800K of .text)

    Still beats a GCC build here, which rolls in at a good solid 0 fps.
    Seemingly stuff can get wonky below 1 fps, but not entirely obvious from
    the code.


    Handling suffixes at the end of a cache-line is not too bad if the
    cache already handles instructions spanning a cache line. Assume the
    maximum number of suffixes is present and ensure the cache-line is
    wide enough. Or limit the number of suffixes so they fit into the
    half cache-line used for spanning.


    Difference:
    With a prefix, you know in advance the prefix exists (the prefix is
    immediately visible);
    With a suffix, can only know it exists if it is visible.

    Instead with a prefix one only knows the instruction exists if it is visible. I do not think it makes much difference. Except that it may be harder to decode an instruction and look for a prefix that comes before it.


    If PC somehow lands between the prefix and the instruction it applies
    to, then something has gone wrong.

    But, even then, checking if PC-4 is a jumbo prefix, or PC-8 is a J52I or similar, isn't exactly a hard task. CPU doesn't need to do this though,
    and if by some chance it does happen, "it is what it is".

    Still better than say:
    x86: Jumping into the middle of an instruction results in incoherent
    garbage from then on (automatic re-alignment will not happen);
    RV-C: Will typically re-align within 3-5 16-bit words.


    With both XG2/XG3 and RV64G+JX, branching into the middle of a prefixed instruction should resolve within 1 or 2 instruction words (32-bit).

    Given in these cases that jumbo-prefixed encodings are the minority,
    even starting at a random 32-bit aligned location is most likely to land
    on a valid instruction boundary. Landing on the second word of a
    misaligned J52I prefix will still align within 2 words (reading one
    garbage instruction, followed by the base-instruction that the prefix
    had been modifying; with a very low probability of reading a false jumbo prefix).



    With XG1, it is most likely to realign within 3 16-bit words.
    It was more likely to realign quickly as '111' is more statistically
    selective than '11', and the probability of seeing multiple false-hit misaligned 32-bit instructions in a row was very low (and, the
    instruction format and encoding-space layout made decoding multiple consecutive misaligned 32-bit instructions very unlikely).

    While RV-C has a similar property (causing the instruction stream to
    realign), it is less selective due to the instruction formats so there
    is a higher probability of seeing two false 32-bit instructions in a row.

    In either case, not seeing a false 32-bit instruction would lead to it
    seeing a 16-bit instruction, which would then (almost inevitably)
    re-align the decoder.


    Within XG2 and XG3, one would not see a false jumbo-prefixed
    instruction, as the instruction stream is encoded as instruction words
    where prefixes and suffixes are mutually exclusive (so, in this way, it
    is almost more like UTF-8; where it can unambiguously re-align the
    byte-stream within a single codepoint).



    This property doesn't exist for x86 though, which would happily just
    result in reading a never-ending stream of misaligned garbage instructions.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Jan 7 01:22:54 2026
    From Newsgroup: comp.arch

    On 1/6/2026 5:49 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    <snip>

    One would argue that maybe prefixes are themselves wonky, but otherwise
    one needs:
    Instructions that can directly encode the presence of large immediate
    values, etc;

    This is the direction of My 66000.

    The instruction stream is a linear stream of words.
    The first word of each instruction encodes its total length.
    What follows the instruction itself are merely constants used as
    operands in the instruction itself. All constants are 1 or 2
    words in length.

    I would not call this means "prefixed" or "suffixed". Generally,
    prefixes and suffixes consume bits of the prefix/suffix so that
    the constant (in my case) is not equal to container size. This
    leads to wonky operand/displacement sizes not equal 2^(3+k).


    OK.

    As can be noted:
    XG2/3: Prefix scheme, 1/2/3 x 32-bit
    The 96-bit cases are determined by two prefixes.
    Requires looking at 2 words to know total length.
    RV64+Jx:
    Total length is known from the first instruction word:
    Base op: 32 bits;
    J21I: 64 bits
    J52I: 96 bits.
    There was a J22+J22+LUI special case,
    but I now consider this as deprecated.
    J52I+ADDI is now considered preferable.

    As for Imm/Disp sizes:
    XG1: 9/33/57
    XG2 and XG3: 10/33/64
    RV+JX: 12/33/64

    For XG1, the 57-bit size was rarely used and only optionally supported,
    mostly because of the great "crap all of immediate values between 34 and
    62 bits" gulf.


    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one views
    the instruction stream as linear, whereas suffixes add weirdness and are >>> effectively retro-causal, and for any fetch to be safe at the end of a
    cache line one would need to prove the non-existence of a suffix; so
    better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction.

    But not "self identified".


    Yeah, if you can't know whether or not more instruction follows after
    the first word by looking at the first word, this is a drawback.

    Also, if you have to look at some special combination of register
    specifiers and/or a lot of other bits, this is also a problem.


    I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the current
    instruction slot plus a following area of the cache-line to the decoder.

    Handling suffixes at the end of a cache-line is not too bad if the cache
    already handles instructions spanning a cache line. Assume the maximum
    number of suffixes is present and ensure the cache-line is wide enough.
    Or limit the number of suffixes so they fit into the half cache-line
    used for spanning.

    It is easier to handle interrupts with suffixes. The suffix can just be
    treated as a NOP. Adjusting the position of the hardware interrupt to
    the start of an instruction then does not have to worry about accounting
    for a prefix / suffix.

    I would have thought that the previous instruction (last one retired) would provide the starting point of the subsequent instruction. This way you don't have to worry about counting prefixes or suffixes.


    Yeah.

    My thinking is, typical advance:
    IF figures out how much to advance;
    Next instruction gets PC+Step.

    Then interrupt:
    Figure out which position in the pipeline interrupt starts from;
    Start there, flushing the rest of the pipeline;
    For a faulting instruction, this is typically the EX1 or EX2 stage.
    EX1 if it is a TRAP or SYSCALL;
    EX2 if it is a TLB miss or similar;
    Unless EX2 is not a valid spot (flush or bubble),
    then look for a spot that is not a flush or bubble.
    This case usually happens for branch-related TLB misses.]

    Usually EX3 or WB is too old, as it would mean re-running previous instructions.

    Getting the exact stage-timing correct for interrupts is a little
    fiddly, but worrying about prefix/suffix/etc issues with interrupts
    isn't usually an issue, except that if somehow PC ended up pointing
    inside another instruction, I would consider this a fault.

    Usually for sake of branch-calculations in XG3 and RV, it is relative to
    the BasePC before the prefix in the case of prefixed encodings. This
    differs from XG1 and XG2 which defined branches relative to the PC of
    the following instruction.

    Though, this difference was partly due to a combination of
    implementation reasons and for consistency with RISC-V (when using a
    shared encoding space, makes sense if all the branches define PC
    displacements in a consistent way).


    Though, there is the difference that XG3's branches use a 32-bit scale
    rather than a 16-bit scale. Well, and unlike RV's displacements, they
    are not horrible confetti (*1).

    *1: One can try to write a new RV decoder, and then place bets on
    whether they will get JAL and Bcc encodings correct on the first try.
    IME, almost invariably, one will screw these up in some way on the first attempt. Like, JAL's displacement encoding is "the gift that keeps on
    giving" in this sense.

    Like, they were like:
    ADDI / Load:
    Yay, contiguous bits;
    Store:
    Well, swap the registers around and put the disp where Rd went.
    Bcc:
    Well, take the Store disp and just shuffle around a few more bits;
    JAL:
    Well, now there are some more bits, and Rd is back, ...
    Why not keep some of the bits from Bcc,
    but stick everything else in random places?...
    Well, I guess some share the relative positions as LUI, but, ...

    Not perfect in XG3 either, but still:
    { opw[5] ? 11'h7FF : 11'h000, opw[11:6], opw[31:16] }
    Is nowhere near the same level of nasty...


    Well, nevermind if actual decoder has the reverse issue:
    In the VL core and JX2VM, it was internally repacked back into XG2 form internally, which means a little bit of hair going on here. Also I was originally going to relocate it in the encoding space, but ended up
    moving back to its original location as for reasons (mostly due to
    sharing the same decoder) having BRA/BSR in two different locations
    would have effectively burned more encoding space than just leaving it
    where it had been in XG1/XG2 (even if having BRA/BSR in the F0 block is
    "kinda stupid" given it is "very much not a 3R instruction", but, ...).



    At least in most other instructions, the imm/disp bits remain
    contiguous. I instead differed by making the Rn/Rd spot be used as a
    source register by some instructions (taking on the role of Rt/Rs2), but
    IMO this is the lesser of two evils. Would rather have an Rd that is
    sometimes a source, than imm/disp fields that change chaotically from
    one instruction to another.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Jan 7 05:34:28 2026
    From Newsgroup: comp.arch

    On 2026-01-07 2:22 a.m., BGB wrote:
    On 1/6/2026 5:49 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    <snip>

    One would argue that maybe prefixes are themselves wonky, but otherwise >>>> one needs:
    Instructions that can directly encode the presence of large immediate
    values, etc;

    This is the direction of My 66000.

    The instruction stream is a linear stream of words.
    The first word of each instruction encodes its total length.
    What follows the instruction itself are merely constants used as
    operands in the instruction itself. All constants are 1 or 2
    words in length.

    I would not call this means "prefixed" or "suffixed". Generally,
    prefixes and suffixes consume bits of the prefix/suffix so that
    the constant (in my case) is not equal to container size. This
    leads to wonky operand/displacement sizes not equal 2^(3+k).


    OK.

    As can be noted:
    -a XG2/3: Prefix scheme, 1/2/3 x 32-bit
    -a-a-a The 96-bit cases are determined by two prefixes.
    -a-a-a Requires looking at 2 words to know total length.
    -a RV64+Jx:
    -a-a-a Total length is known from the first instruction word:
    -a-a-a-a-a Base op: 32 bits;
    -a-a-a-a-a J21I: 64 bits
    -a-a-a-a-a J52I: 96 bits.
    -a-a-a There was a J22+J22+LUI special case,
    -a-a-a-a-a but I now consider this as deprecated.
    -a-a-a-a-a J52I+ADDI is now considered preferable.

    As for Imm/Disp sizes:
    -a XG1: 9/33/57
    -a XG2 and XG3: 10/33/64
    -a RV+JX: 12/33/64

    For XG1, the 57-bit size was rarely used and only optionally supported, mostly because of the great "crap all of immediate values between 34 and
    62 bits" gulf.


    Or, the use of suffix-encodings (which is IMHO worse than prefix
    encodings; at least prefix encodings make intuitive sense if one views >>>> the instruction stream as linear, whereas suffixes add weirdness and
    are
    effectively retro-causal, and for any fetch to be safe at the end of a >>>> cache line one would need to prove the non-existence of a suffix; so
    better to not go there).

    I agree with this. Prefixes seem more natural, large numbers expanding
    to the left, suffixes seem like a big-endian approach. But I use
    suffixes for large constants. I think with most VLI constant data
    follows the instruction.

    But not "self identified".


    Yeah, if you can't know whether or not more instruction follows after
    the first word by looking at the first word, this is a drawback.

    I do not find having to look at the second word much of a drawback.
    There is not much difference looking at either the first or second word.
    The words are all sitting available on the cache-line.
    Large constants are treated as more of the exceptional case in Qupls4.
    The immediate mode instructions can handle 28-bit constants. One suffix expands that out to 64-bits.
    By placing the constant info in the suffix, Qupls4 gains bits in the instruction that can be used for other purposes rather than handling
    large constants.
    Using a suffix (or prefix) does lead to odd sized constants, but as long
    as they are large enough so what.

    Also, if you have to look at some special combination of register
    specifiers and/or a lot of other bits, this is also a problem.

    I do not know. It depends how it is handled. Qups4 decodes r63 as the
    constant zero, so it is a special register spec, like r0 in many
    machines. r62 gets decoded as the IP value.
    I think the constant decode is not likely on the timing critical path
    provided it is semi-sane.
    Currently on the timing path for Qupls4 is expanding out instructions to multiple micro-ops. I think it needs another pipeline stage.


    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a I find constant data easier to work with that
    way and they can be processed in the same clock cycle as a decode so
    they do not add to the dynamic instruction count. Just pass the current
    instruction slot plus a following area of the cache-line to the decoder. >>>
    Handling suffixes at the end of a cache-line is not too bad if the cache >>> already handles instructions spanning a cache line. Assume the maximum
    number of suffixes is present and ensure the cache-line is wide enough.
    Or limit the number of suffixes so they fit into the half cache-line
    used for spanning.

    It is easier to handle interrupts with suffixes. The suffix can just be
    treated as a NOP. Adjusting the position of the hardware interrupt to
    the start of an instruction then does not have to worry about accounting >>> for a prefix / suffix.

    I would have thought that the previous instruction (last one retired)
    would
    provide the starting point of the subsequent instruction. This way you
    don't
    have to worry about counting prefixes or suffixes.


    Yeah.

    My thinking is, typical advance:
    -a IF figures out how much to advance;
    -a Next instruction gets PC+Step.

    Qupls4 does not bother figuring out how much to advance; it would be too
    slow. It just assumes an increment. Why figure it out? If there are instructions in a bundle just advance to the next bundle. I found the IP selection on the timing critical path, the BTB had to be adjusted.

    Then interrupt:
    -a Figure out which position in the pipeline interrupt starts from;
    -a Start there, flushing the rest of the pipeline;
    -a For a faulting instruction, this is typically the EX1 or EX2 stage.
    -a-a-a EX1 if it is a TRAP or SYSCALL;
    -a-a-a EX2 if it is a TLB miss or similar;
    -a-a-a-a-a Unless EX2 is not a valid spot (flush or bubble),
    -a-a-a-a-a-a-a then look for a spot that is not a flush or bubble.
    -a-a-a-a-a This case usually happens for branch-related TLB misses.]

    Usually EX3 or WB is too old, as it would mean re-running previous instructions.

    Getting the exact stage-timing correct for interrupts is a little
    fiddly, but worrying about prefix/suffix/etc issues with interrupts
    isn't usually an issue, except that if somehow PC ended up pointing
    inside another instruction, I would consider this a fault.

    Usually for sake of branch-calculations in XG3 and RV, it is relative to
    the BasePC before the prefix in the case of prefixed encodings. This
    differs from XG1 and XG2 which defined branches relative to the PC of
    the following instruction.

    Though, this difference was partly due to a combination of
    implementation reasons and for consistency with RISC-V (when using a
    shared encoding space, makes sense if all the branches define PC displacements in a consistent way).


    Though, there is the difference that XG3's branches use a 32-bit scale rather than a 16-bit scale. Well, and unlike RV's displacements, they
    are not horrible confetti (*1).

    *1: One can try to write a new RV decoder, and then place bets on
    whether they will get JAL and Bcc encodings correct on the first try.
    IME, almost invariably, one will screw these up in some way on the first attempt. Like, JAL's displacement encoding is "the gift that keeps on giving" in this sense.

    Like, they were like:
    -a ADDI / Load:
    -a-a-a Yay, contiguous bits;
    -a Store:
    -a-a-a Well, swap the registers around and put the disp where Rd went.
    -a Bcc:
    -a-a-a Well, take the Store disp and just shuffle around a few more bits;
    -a JAL:
    -a-a-a Well, now there are some more bits, and Rd is back, ...
    -a-a-a Why not keep some of the bits from Bcc,
    -a-a-a-a-a but stick everything else in random places?...
    -a-a-a Well, I guess some share the relative positions as LUI, but, ...

    Not perfect in XG3 either, but still:
    -a { opw[5] ? 11'h7FF : 11'h000, opw[11:6], opw[31:16] }
    Is nowhere near the same level of nasty...

    Yeah, They may have gone a bit overboard trying to keep the constant
    bits in the same position for RISCV. It does make things smaller.

    Well, nevermind if actual decoder has the reverse issue:
    In the VL core and JX2VM, it was internally repacked back into XG2 form internally, which means a little bit of hair going on here. Also I was originally going to relocate it in the encoding space, but ended up
    moving back to its original location as for reasons (mostly due to
    sharing the same decoder) having BRA/BSR in two different locations
    would have effectively burned more encoding space than just leaving it
    where it had been in XG1/XG2 (even if having BRA/BSR in the F0 block is "kinda stupid" given it is "very much not a 3R instruction", but, ...).



    At least in most other instructions, the imm/disp bits remain
    contiguous. I instead differed by making the Rn/Rd spot be used as a
    source register by some instructions (taking on the role of Rt/Rs2), but
    IMO this is the lesser of two evils. Would rather have an Rd that is sometimes a source, than imm/disp fields that change chaotically from
    one instruction to another.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sat Jan 31 18:14:31 2026
    From Newsgroup: comp.arch

    I attempted to construct an architecture with ordinary variable-length instructions, based on the Concertina II architecture, but with register
    banks of 16 registers instead of 32. However, it didn't work out; there
    wasn't enough opcode space for the 16-bit register-to-register
    instructions. This surprised me; maybe I can figure something out.

    John Savard

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 4 20:36:52 2026
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    I attempted to construct an architecture with ordinary variable-length instructions, based on the Concertina II architecture, but with register banks of 16 registers instead of 32. However, it didn't work out; there wasn't enough opcode space for the 16-bit register-to-register
    instructions. This surprised me; maybe I can figure something out.

    In my honest opinion, to get 16-bit instructions to work/help/add-value
    you have to sacrifice much elsewhere in your ISA.

    Whether you can get it to work is a semblance of your balance.

    John Savard

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Wed Feb 4 21:32:58 2026
    From Newsgroup: comp.arch

    On 12/28/25 6:53 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for >>>>>>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
    The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    ATOMIC-to-Memory-size [address]
    ADD Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped
    out over the memory hierarchy. When the operation runs into a cache
    containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol
    depends on writeable implying {exclusive or modified} which is typical.} >>>
    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    I wonder if there is an issue of communicating intention to the
    computer. Using atomic-to-memory may be intended to communicate
    that the operation is expected to be under contention or that
    moderating the impact under high contention is more important
    than having a fast "happy path".

    This seems to be similar to branch hints and predication in that
    urging the computer to handle the task in a specific way may not
    be optimal for the goal of the user/programmer. A programmer
    might use predication to avoid a branch that is expected to be
    poorly predicted or to have more consistent execution time. The
    former could be inappropriate for the computer to obey if the
    branch predictor became effective for that branch. If prediction
    is accurate, predicate prediction could improve performance but
    would break execution time consistency. Even reducing execution
    time when the predicate is known early might go against the
    programmer's intent by leaking information.

    If the requesting core has the cache block in exclusive or
    modified state, remote execution might be less efficient. Yet it
    may also be possible that the block is expected to be moved to
    another core such that this pushing the data manipulation to a
    "centralized" location would improve performance as was the
    programmer intent (rather than avoiding contention overhead). (I
    suppose in My 66000, a programmer would use a push/"prefetch"
    instruction to move the cache block to a more central location,
    but even that might be sub-optimal if the hardware can predict
    the next data consumer such that centrally located would be
    slower than shifting the data closer to the consumer.)

    If the contention is from false sharing (having multiple atomic
    data in a cache block seems to be considered bad programming
    practice, so this should not be common unless cache block size
    grows), hardware could theoretically provide special word caches
    (or "lossy" block compression where part of the block is dropped)
    for moderating the impact of false sharing. This would change the
    optimization preferences for the program (more compact data might
    be preferred if false sharing is less of a problem).

    I do not know what the best interface would be, but it seems that
    some care should be taken to account for differing intent when a
    programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
    hints and directives can have unexpected performance changes
    under different microarchitectures or different usage.

    Clever microarchitecture can make some optimizations sub-optimal
    as well as cause programmers to ask "why didn't you do it the way
    I told you to do it?!"

    (I think I may not be communicating well. I am kind of tired
    right now and this topic is complicated.)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 5 18:49:39 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 12/28/25 6:53 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <quadibloc@invalid.invalid> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for >>>>>>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.

    -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember
    something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
    The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    ATOMIC-to-Memory-size [address]
    ADD Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>> out over the memory hierarchy. When the operation runs into a cache
    containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol
    depends on writeable implying {exclusive or modified} which is typical.} >>>
    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    I wonder if there is an issue of communicating intention to the
    computer. Using atomic-to-memory may be intended to communicate
    that the operation is expected to be under contention or that
    moderating the impact under high contention is more important
    than having a fast "happy path".

    There is a speed of light problem here. Communicating across a
    computer <system> is a microsecond time problem, whereas executing
    instructions is a nanosecond time problem.

    And this is exactly where Add-to-Memory gains over Interferable
    ATOMIC events--you only pay the latency once, now while the latency
    is higher than possible with LL-SC, it is WAY LOWER than worst case
    with LL-SC under serious contention.

    This seems to be similar to branch hints and predication in that
    urging the computer to handle the task in a specific way may not
    be optimal for the goal of the user/programmer.
    Explain
    A programmer
    might use predication to avoid a branch that is expected to be
    poorly predicted or to have more consistent execution time. The

    My 66000 predication can avoid 2 branches--it operates under the
    notion that if FETCH reaches the join point before condition is
    known, then predication is always faster than branching.

    former could be inappropriate for the computer to obey if the
    branch predictor became effective for that branch. If prediction
    is accurate, predicate prediction could improve performance but
    would break execution time consistency. Even reducing execution
    time when the predicate is known early might go against the
    programmer's intent by leaking information.

    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    If the requesting core has the cache block in exclusive or
    modified state, remote execution might be less efficient. Yet it
    may also be possible that the block is expected to be moved to
    another core such that this pushing the data manipulation to a
    "centralized" location would improve performance as was the
    programmer intent (rather than avoiding contention overhead). (I
    suppose in My 66000, a programmer would use a push/"prefetch"
    instruction to move the cache block to a more central location,
    but even that might be sub-optimal if the hardware can predict
    the next data consumer such that centrally located would be
    slower than shifting the data closer to the consumer.)

    I have been betting that (in the general case) software will
    remain incapable of doing such predictions, for quite some time.

    If the contention is from false sharing (having multiple atomic
    data in a cache block seems to be considered bad programming
    practice, so this should not be common unless cache block size
    grows), hardware could theoretically provide special word caches
    (or "lossy" block compression where part of the block is dropped)
    for moderating the impact of false sharing. This would change the optimization preferences for the program (more compact data might
    be preferred if false sharing is less of a problem).

    I do not know what the best interface would be, but it seems that
    some care should be taken to account for differing intent when a
    programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
    hints and directives can have unexpected performance changes
    under different microarchitectures or different usage.

    Clever microarchitecture can make some optimizations sub-optimal
    as well as cause programmers to ask "why didn't you do it the way
    I told you to do it?!"

    Instruction scheduling is in effect a optimization for one implementation
    that may not hold for other implementations. In Order pipelines need instruction scheduling, OoO do not, GBOoO generally perform better if
    /when instructions are not (or lesser) scheduled.

    Loop unrolling is a case where if your machine has vVM the code
    runs faster when not {unrolled and executed with branch prediction}
    than when {}.

    (I think I may not be communicating well. I am kind of tired
    right now and this topic is complicated.)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 5 21:27:51 2026
    From Newsgroup: comp.arch


    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    Paul Clayton <paaronclayton@gmail.com> posted:

    -----------------
    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    I wonder if there is an issue of communicating intention to the
    computer. Using atomic-to-memory may be intended to communicate
    that the operation is expected to be under contention or that
    moderating the impact under high contention is more important
    than having a fast "happy path".

    There is a speed of light problem here. Communicating across a
    computer <system> is a microsecond time problem, whereas executing instructions is a nanosecond time problem.

    And this is exactly where Add-to-Memory gains over Interferable
    ATOMIC events--you only pay the latency once, now while the latency
    is higher than possible with LL-SC, it is WAY LOWER than worst case
    with LL-SC under serious contention.

    This seems to be similar to branch hints and predication in that
    urging the computer to handle the task in a specific way may not
    be optimal for the goal of the user/programmer.
    Explain
    A programmer
    might use predication to avoid a branch that is expected to be
    poorly predicted or to have more consistent execution time. The

    My 66000 predication can avoid 2 branches--it operates under the
    notion that if FETCH reaches the join point before condition is
    known, then predication is always faster than branching.

    former could be inappropriate for the computer to obey if the
    branch predictor became effective for that branch. If prediction
    is accurate, predicate prediction could improve performance but
    would break execution time consistency. Even reducing execution
    time when the predicate is known early might go against the
    programmer's intent by leaking information.

    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    If the requesting core has the cache block in exclusive or
    modified state, remote execution might be less efficient. Yet it

    In My 66000 architecture, an atomic operation on memory will have
    the property of being executed where the cache line in a modifiable
    state happens to reside. Here you don't want to "bring data in"
    nor do you want to "push data out" you want the memory hierarchy
    to be perturbed as little as possible WHILE performing the ATOMIC
    event.

    The cost is that each cache will have an adder--with the sizes of
    cache these days, said adder is a trifling in area and seldom used
    so it is low power.

    may also be possible that the block is expected to be moved to
    another core such that this pushing the data manipulation to a "centralized" location would improve performance as was the
    programmer intent (rather than avoiding contention overhead). (I
    suppose in My 66000, a programmer would use a push/"prefetch"
    instruction to move the cache block to a more central location,
    but even that might be sub-optimal if the hardware can predict
    the next data consumer such that centrally located would be
    slower than shifting the data closer to the consumer.)

    I have been betting that (in the general case) software will
    remain incapable of doing such predictions, for quite some time.

    If the contention is from false sharing (having multiple atomic
    data in a cache block seems to be considered bad programming
    practice, so this should not be common unless cache block size
    grows), hardware could theoretically provide special word caches
    (or "lossy" block compression where part of the block is dropped)
    for moderating the impact of false sharing. This would change the optimization preferences for the program (more compact data might
    be preferred if false sharing is less of a problem).

    I do not know what the best interface would be, but it seems that
    some care should be taken to account for differing intent when a programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
    hints and directives can have unexpected performance changes
    under different microarchitectures or different usage.

    Clever microarchitecture can make some optimizations sub-optimal
    as well as cause programmers to ask "why didn't you do it the way
    I told you to do it?!"

    Instruction scheduling is in effect a optimization for one implementation that may not hold for other implementations. In Order pipelines need instruction scheduling, OoO do not, GBOoO generally perform better if
    /when instructions are not (or lesser) scheduled.

    Loop unrolling is a case where if your machine has vVM the code
    runs faster when not {unrolled and executed with branch prediction}
    than when {}.

    (I think I may not be communicating well. I am kind of tired
    right now and this topic is complicated.)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 16:42:56 2026
    From Newsgroup: comp.arch

    On 12/19/25 12:41 PM, Anton Ertl wrote:
    John Savard <quadibloc@invalid.invalid> writes:
    On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
    Or in other words, if you can decode K-instructions per cycle, you'd
    better be able to execute K-instructions per cycle--or you have a
    serious blockage in your pipeline.

    No.

    If you flipped "decode" and "execute" in that sentence above, I would 100% >> agree. And maybe this _is_ just a typo.

    But if you actually did mean that sentence exactly as written, I would
    disagree. This is why: I regard executing instructions as 'doing the
    actual work' and decoding instructions as... some unfortunate trivial
    overhead that can't be avoided.

    It does not matter what "the actual work" is and what isn't. What
    matters is how expensive it is to make a particular part wider, and
    how paying that cost benefits the IPC. At every step you add width to
    the part with the best benefit/cost ratio.

    And looking at recent cores, we see that, e.g., Skymont can decode
    3x3=9 instructions per cycle, rename 8 per cycle, has 26 ports to
    functional units (i.e., can execute 26 uops in one cycle); I don't
    know how many instructions it can retire per cycle, but I expect that
    it is more than 8 per cycle.

    So the renamer is the bottleneck, and that's also the idea behind
    top-down microarchitecture analysis (TMA) for determining how software interacts with the microarchitecture. That idea is coming out of
    Intel, but if Intel is finding it hard to make wider renamers rather
    than wider other parts, I expect that the rest of the industry also
    finds that hard (especially for architectures where decoding is
    cheaper), and (looking at ARM A64) where instructions with more
    demands on the renamer exist.

    It is not clear to me that the renamer is clearly the bottleneck
    merely because it is narrower. Wider rename might not increase
    performance that much. The 9-instruction decode is a consequence
    of using three similar 3-wide decoders with predicting three
    targets (so taken branches can be handled). If the target
    predictor for straightline code is for each three instructions,
    then all the decoders should be the same width. Avoiding 4-wide
    decoders makes sense both to support more taken branches and to
    avoid the extra instruction length determination complexity.

    (Since rename is 8 -|ops and decode is 9 _instructions_ the width
    difference is greater than it might seem from the numbers. I
    think x86 designs still have more instruction cracking than
    instruction fusion.)

    A component can be small because it is expensive to make it
    larger or because there is less benefit to making it larger. I
    would consider the former a bottleneck for the designers, but
    the later seems to imply that other constraints should be
    loosened before more design effort is applied to that component.

    I have not been paying attention to achieved IPC, but I got the
    impression that for messy "integer" code achieved IPC was not up
    to 4. While the front end needs to process more instructions
    than are committed due to branch mispredictions, it seems at
    least plausible that wider rename is not that helpful generally.

    Of course, supporting bursts makes sense rCo otherwise execute
    would not be so wide nor retirement rCo but more order-constrained
    front-end may not benefit as much from enabling bursts of high
    activity.

    With banking and replication a lot of RAT read and write ports
    could be supported and if rename-width internal forwarding
    avoids RAT access (instead transferring from the free list) the
    port count could be further reduced. Merging reused source names
    could also reduce RAT port count.

    I am not suggesting that increasing rename width is trivial; it
    seems somewhat similar to the dependency tracking for in-order
    superscalar execution, which is part of what motivated VLIW.

    (In theory, dependent operations could have higher rename
    latency because they cannot execute for at least one cycle after
    the source value providing operation. I do not recall reading
    any proposal to exploit such and it would increase the
    complexity of filling the scheduler. Perhaps such variable
    rename latency might be useful in a banked RAT where bank
    conflicts could delay renaming. Preferring non-dependent (and
    older) operations might reduce the impact of bank contention.)

    I *suspect* that there is some architectural and
    microarchitectural opportunity to reduce the overhead of
    renaming, checkpointing, and other out-of-order execution
    mechanisms. Caching dependency information in a decoded
    instruction cache might be helpful, e.g.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 17:48:13 2026
    From Newsgroup: comp.arch

    On 2/5/26 4:27 PM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Paul Clayton <paaronclayton@gmail.com> posted:

    [snip]
    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    When would one want to decouple LL and SC into function calls
    away from the computation? Perhaps for in-place software
    instrumenation?

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
    Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    I wonder if there is an issue of communicating intention to the
    computer. Using atomic-to-memory may be intended to communicate
    that the operation is expected to be under contention or that
    moderating the impact under high contention is more important
    than having a fast "happy path".

    There is a speed of light problem here. Communicating across a
    computer <system> is a microsecond time problem, whereas executing
    instructions is a nanosecond time problem.

    And this is exactly where Add-to-Memory gains over Interferable
    ATOMIC events--you only pay the latency once, now while the latency
    is higher than possible with LL-SC, it is WAY LOWER than worst case
    with LL-SC under serious contention.

    Yes. Even a single adder would have higher throughput than ping-
    ponging a cache block. One might even support a three-or-more
    input two-or-more result adder to improve throughtput (or
    perhaps exploit usually smaller addends) to increase throughput,
    though I suspect there would practically never be a case where a
    simple adder would have insufficient throughput.

    This seems to be similar to branch hints and predication in that
    urging the computer to handle the task in a specific way may not
    be optimal for the goal of the user/programmer.

    Explain

    Branch hints can be intended to reduce branch predictor aliasing
    (i.e., assume the static hint is used instead of a dynamic
    predictor), to provide agree information, to prefer one path
    even if it is less likely, to provide an initialization of the
    (per-address component only?) branch predictor, or for some
    other motive. The interface/architecture might not be specific
    about how such information will be used, especially if it is a
    hint (and programmers might disagree about what the best
    interface would be). If the interface is not very specific, a
    microarchitecture might violate a programmer's
    desire/expectation by ignoring the hint or using it in a
    different way.

    Similarly predication can be motivated to avoid fetch
    redirection (initial ARM and My 66000), to facilitate constant
    time execution, to avoid the performance cost of branch
    mispredictions, or perhaps for some reason that does not come to
    mind. Predicate prediction would foil constant time execution
    and might reduce performance (or merely introduce weird
    performance variation). Even the fetch optimization might be
    undone if the hardware discovers that the condition is extremely
    biased and folds out the rarely used instructions; which would
    be good for performance if the bias continues, but if the bias
    changes just frequently enough it could hurt performance.

    [snip]
    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    Except that prediction could violate the time constancy assumed
    by the programmer.

    If the requesting core has the cache block in exclusive or
    modified state, remote execution might be less efficient. Yet it

    In My 66000 architecture, an atomic operation on memory will have
    the property of being executed where the cache line in a modifiable
    state happens to reside. Here you don't want to "bring data in"
    nor do you want to "push data out" you want the memory hierarchy
    to be perturbed as little as possible WHILE performing the ATOMIC
    event.

    Okay, that makes sense. The earlier statement implied (to me)
    that the operation was always "centralized".

    The cost is that each cache will have an adder--with the sizes of
    cache these days, said adder is a trifling in area and seldom used
    so it is low power.

    may also be possible that the block is expected to be moved to
    another core such that this pushing the data manipulation to a
    "centralized" location would improve performance as was the
    programmer intent (rather than avoiding contention overhead). (I
    suppose in My 66000, a programmer would use a push/"prefetch"
    instruction to move the cache block to a more central location,
    but even that might be sub-optimal if the hardware can predict
    the next data consumer such that centrally located would be
    slower than shifting the data closer to the consumer.)

    I have been betting that (in the general case) software will
    remain incapable of doing such predictions, for quite some time.

    A bet with very good odds in general, but I am sure there are
    still more than one "Mel" around who could optimize data
    movement.

    In cases like simple pipelines, the data communication pattern
    is obvious. For an embedded system, communicating stores might
    go directly to another cores local memory. With more general
    purpose compute, thread migration would be more common (even
    with more cores than runnable threads) and abstracting a
    communication interface might be more challenging.

    [snip]
    Clever microarchitecture can make some optimizations sub-optimal
    as well as cause programmers to ask "why didn't you do it the way
    I told you to do it?!"

    Instruction scheduling is in effect a optimization for one implementation
    that may not hold for other implementations. In Order pipelines need
    instruction scheduling, OoO do not, GBOoO generally perform better if
    /when instructions are not (or lesser) scheduled.

    Even OoO implementations can benefit from facilitating
    instruction fusion and perhaps even scheduler allocation to
    reduce communication overhead. A RAT-based renamer might also
    benefit from having duplicate register names in the same rename
    chunk. With in-order renaming, in theory a compiler could also
    optimize RAT bank conflicts. This might not be considered
    scheduling, but when the instruction stream is serial placement
    is effectively scheduling.

    Loop unrolling is a case where if your machine has vVM the code
    runs faster when not {unrolled and executed with branch prediction}
    than when {}.

    Yeah.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Feb 9 19:28:25 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 2/5/26 4:27 PM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Paul Clayton <paaronclayton@gmail.com> posted:

    [snip]
    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    When would one want to decouple LL and SC into function calls
    away from the computation? Perhaps for in-place software
    instrumenation?

    Write, in pure K&R C, the functionality for LoadLocked and
    StoreConditional.

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
    Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    I wonder if there is an issue of communicating intention to the
    computer. Using atomic-to-memory may be intended to communicate
    that the operation is expected to be under contention or that
    moderating the impact under high contention is more important
    than having a fast "happy path".

    There is a speed of light problem here. Communicating across a
    computer <system> is a microsecond time problem, whereas executing
    instructions is a nanosecond time problem.

    And this is exactly where Add-to-Memory gains over Interferable
    ATOMIC events--you only pay the latency once, now while the latency
    is higher than possible with LL-SC, it is WAY LOWER than worst case
    with LL-SC under serious contention.

    Yes. Even a single adder would have higher throughput than ping-
    ponging a cache block. One might even support a three-or-more
    input two-or-more result adder to improve throughtput (or
    perhaps exploit usually smaller addends) to increase throughput,
    though I suspect there would practically never be a case where a
    simple adder would have insufficient throughput.

    This seems to be similar to branch hints and predication in that
    urging the computer to handle the task in a specific way may not
    be optimal for the goal of the user/programmer.

    Explain

    Branch hints can be intended to reduce branch predictor aliasing
    (i.e., assume the static hint is used instead of a dynamic
    predictor), to provide agree information, to prefer one path
    even if it is less likely, to provide an initialization of the
    (per-address component only?) branch predictor, or for some
    other motive. The interface/architecture might not be specific
    about how such information will be used, especially if it is a
    hint (and programmers might disagree about what the best
    interface would be). If the interface is not very specific, a microarchitecture might violate a programmer's
    desire/expectation by ignoring the hint or using it in a
    different way.

    Similarly predication can be motivated to avoid fetch
    redirection (initial ARM and My 66000), to facilitate constant
    time execution, to avoid the performance cost of branch
    mispredictions, or perhaps for some reason that does not come to
    mind. Predicate prediction would foil constant time execution
    and might reduce performance (or merely introduce weird
    performance variation). Even the fetch optimization might be
    undone if the hardware discovers that the condition is extremely
    biased and folds out the rarely used instructions; which would
    be good for performance if the bias continues, but if the bias
    changes just frequently enough it could hurt performance.

    [snip]
    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    Except that prediction could violate the time constancy assumed
    by the programmer.

    Time constancy is provided by execution both then clause and else clause
    and using CMOV to decide on flow.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Mon Feb 9 20:36:54 2026
    From Newsgroup: comp.arch

    On 2/9/26 2:28 PM, MitchAlsup wrote:

    Paul Clayton <paaronclayton@gmail.com> posted:

    On 2/5/26 4:27 PM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Paul Clayton <paaronclayton@gmail.com> posted:

    [snip]
    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    When would one want to decouple LL and SC into function calls
    away from the computation? Perhaps for in-place software
    instrumenation?

    Write, in pure K&R C, the functionality for LoadLocked and
    StoreConditional.

    Then why would the compiler not inline such? It seems reasonable
    (to me) to blame poor scaling performance in that case on the
    compiler which did not inline a one instruction function (or on
    the developer who intentionally disabled such optimization).

    [snip]

    [snip]
    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    Except that prediction could violate the time constancy assumed
    by the programmer.

    Time constancy is provided by execution both then clause and else clause
    and using CMOV to decide on flow.

    This means ordinary My 66000 predication cannot be used for
    such. I have no idea whether writers of cryptographic software
    would be upset that a mechanism which seems like it would
    provide constant time is documented not to do so. (I would
    _guess_ that most of the embedded strong guarantee real time
    software might select hardware that never does predict
    predication since caches and similar general purpose
    optimizations tend to increase the difficulty of providing
    strong timing guarantees.)


    Is My 66000 going to have CMOV? What about something like some
    of the AArch64 simple conditional operations? (Yes, such gets
    un-RISCy both from count of defined instructions and instruction
    complexity, but I have read that the supported operations are
    somewhat common for single instruction hammock branches and are
    easier to fit microarchitecturally and to fit within a 32-bit
    instruction.) The code bloat (8 bytes vs. 4 bytes) and indirect
    encoding (i.e., vagueries of idiom recognition rCo though an
    encoded instruction can be implemented with performance that
    breaks expectations) are disadvantages of just using PRED, but
    PRED also removes artificial restrictions like only incrementing
    by 1 (such that idiom recognition could be implemented to fast
    path any conditional add_immediate).

    One interesting case for predication is code like the following:
    if (a < 25) {a += 8;}

    The predicate is known to be available at least early enough to
    nullify the addition executed in parallel with the compare
    because the register operand is the same and the only other
    operands are immediates. It is not clear if this presents a
    useful optimization opportunity, but it seems an interesting
    case.

    I am also curious about what you view as the trade-offs of
    conditional move compared with conditional select. Obviously
    conditional select potentially includes a register copy and for
    typical OoO implementations the mechanism is similar because a
    conditional move renames the destination and reads the old value
    and the alternative value. Encoding the condition source may be
    harder for conditional select (e.g., test register A is/is not
    zero, based on test place register B or register C in register
    D, which requires four register names while conditional move
    exploits that one of the sources is the same as the
    destination); for My 66000 a generalized condition after a
    comparison would, I think, require a 6-bit condition bit
    specifier and a register source for the condition, which just
    fits in a 32-bit instruction (6-bit opcode, 6-bit condition bit
    specifier, four 5-bit register names).
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Feb 10 15:55:05 2026
    From Newsgroup: comp.arch

    Paul Clayton <paaronclayton@gmail.com> writes:
    On 2/9/26 2:28 PM, MitchAlsup wrote:


    This means ordinary My 66000 predication cannot be used for
    such. I have no idea whether writers of cryptographic software
    would be upset that a mechanism which seems like it would
    provide constant time is documented not to do so. (I would
    _guess_ that most of the embedded strong guarantee real time
    software might select hardware that never does predict
    predication since caches and similar general purpose
    optimizations tend to increase the difficulty of providing
    strong timing guarantees.)

    ARMv8 includes a "DIT" feature. Data Independent Timing. If
    that feature is present and enabled, software can be
    assured that a DIT capable instruction that processes
    data completes in the same time, regardless of the
    data being processed. Specifically applies to instructions that
    manipulate arithmetic data (loads, stores, integer, floating point,
    conditional compares, crypto instructions, etc).

    https://developer.arm.com/documentation/ddi0601/2025-12/AArch64-Registers/DIT--Data-Independent-Timing?lang=en

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Feb 10 19:02:09 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 2/9/26 2:28 PM, MitchAlsup wrote:

    Paul Clayton <paaronclayton@gmail.com> posted:

    On 2/5/26 4:27 PM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Paul Clayton <paaronclayton@gmail.com> posted:

    [snip]
    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    When would one want to decouple LL and SC into function calls
    away from the computation? Perhaps for in-place software
    instrumenation?

    Write, in pure K&R C, the functionality for LoadLocked and StoreConditional.

    Then why would the compiler not inline such? It seems reasonable
    (to me) to blame poor scaling performance in that case on the
    compiler which did not inline a one instruction function (or on
    the developer who intentionally disabled such optimization).

    Because in pure K&R C there is no concept of atomic things, and
    thus one has to resort to ASM--beyond this, there is no concept
    of inlining.

    [snip]

    [snip]
    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    Except that prediction could violate the time constancy assumed
    by the programmer.

    Time constancy is provided by execution both then clause and else clause and using CMOV to decide on flow.

    This means ordinary My 66000 predication cannot be used for
    such. I have no idea whether writers of cryptographic software
    would be upset that a mechanism which seems like it would
    provide constant time is documented not to do so.

    I neither stated such nor implied such.

    (I would
    _guess_ that most of the embedded strong guarantee real time
    software might select hardware that never does predict
    predication since caches and similar general purpose
    optimizations tend to increase the difficulty of providing
    strong timing guarantees.)


    Is My 66000 going to have CMOV?

    Has had it for years, it also has multi-instruction predication
    since forever.

    What about something like some
    of the AArch64 simple conditional operations? (Yes, such gets
    un-RISCy both from count of defined instructions and instruction
    complexity, but I have read that the supported operations are
    somewhat common for single instruction hammock branches and are
    easier to fit microarchitecturally and to fit within a 32-bit
    instruction.) The code bloat (8 bytes vs. 4 bytes) and indirect
    encoding (i.e., vagueries of idiom recognition rCo though an
    encoded instruction can be implemented with performance that
    breaks expectations) are disadvantages of just using PRED, but
    PRED also removes artificial restrictions like only incrementing
    by 1 (such that idiom recognition could be implemented to fast
    path any conditional add_immediate).

    One interesting case for predication is code like the following:
    if (a < 25) {a += 8;}

    The predicate is known to be available at least early enough to
    nullify the addition executed in parallel with the compare
    because the register operand is the same and the only other
    operands are immediates. It is not clear if this presents a
    useful optimization opportunity, but it seems an interesting
    case.

    I am also curious about what you view as the trade-offs of
    conditional move compared with conditional select. Obviously
    conditional select potentially includes a register copy and for
    typical OoO implementations the mechanism is similar because a
    conditional move renames the destination and reads the old value
    and the alternative value. Encoding the condition source may be
    harder for conditional select (e.g., test register A is/is not
    zero, based on test place register B or register C in register
    D, which requires four register names while conditional move
    exploits that one of the sources is the same as the
    destination); for My 66000 a generalized condition after a
    comparison would, I think, require a 6-bit condition bit
    specifier and a register source for the condition, which just
    fits in a 32-bit instruction (6-bit opcode, 6-bit condition bit
    specifier, four 5-bit register names).

    Clang goes out of its way to convert things like the above into
    CMOV form. Often this pre-optimization takes more instructions
    and runs slower than pure-predication. Almost always when then-
    clause or else-clause is not-already in a register, the CMOV
    version is longer and slower.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Feb 10 20:23:12 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Paul Clayton <paaronclayton@gmail.com> posted:

    On 2/9/26 2:28 PM, MitchAlsup wrote:
    <snip>


    Write, in pure K&R C, the functionality for LoadLocked and
    StoreConditional.

    Then why would the compiler not inline such? It seems reasonable
    (to me) to blame poor scaling performance in that case on the
    compiler which did not inline a one instruction function (or on
    the developer who intentionally disabled such optimization).

    Because in pure K&R C there is no concept of atomic things, and
    thus one has to resort to ASM--beyond this, there is no concept
    of inlining.

    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Tue Feb 10 20:39:11 2026
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).

    Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
    /44, user programs had separate instuction and data mappings. The
    smaller ones, the 11/34, /40, and /60 didn't.

    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Feb 10 21:15:08 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Scott Lurndal <slp53@pacbell.net>:
    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).

    Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
    /44, user programs had separate instuction and data mappings. The
    smaller ones, the 11/34, /40, and /60 didn't.

    The 11/34 was what I had been using in the late 70's (Unix V6)
    and a bit of RSTS/E. That was after several years of TSS8.24
    where self-modifying code was de rigueur.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Feb 11 09:19:14 2026
    From Newsgroup: comp.arch

    John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).

    Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
    /44, user programs had separate instuction and data mappings. The
    smaller ones, the 11/34, /40, and /60 didn't.

    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was trivially comparable.

    The final option was to extract a short prefix (8-bytes) of the key data
    and accept that the full sort would only deliver a partially sorted
    results, so that I had to do a final sort pass over each chunk of equal prefix.

    This could well be faster than a complicated sort operator, even if it
    was run-time generated, particularly since that runtime code would need
    to be called.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 11 22:11:09 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).

    Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
    /44, user programs had separate instuction and data mappings. The
    smaller ones, the 11/34, /40, and /60 didn't.

    We knew it was possible to do runtime code generation, since it's an ancient technique widely used by sort programs to speed up the inner comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was trivially comparable.

    40 years ago, Duff's device was considered a reasonable way to improve performance. It is now considered a means to take over applications
    and machines (i.e., a virus).

    The final option was to extract a short prefix (8-bytes) of the key data
    and accept that the full sort would only deliver a partially sorted
    results, so that I had to do a final sort pass over each chunk of equal prefix.

    This could well be faster than a complicated sort operator, even if it
    was run-time generated, particularly since that runtime code would need
    to be called.

    Terje

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Feb 11 22:12:06 2026
    From Newsgroup: comp.arch

    According to Terje Mathisen <terje.mathisen@tmsw.no>:
    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was >trivially comparable.

    The final option was to extract a short prefix (8-bytes) of the key data
    and accept that the full sort would only deliver a partially sorted
    results, so that I had to do a final sort pass over each chunk of equal >prefix.

    I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
    deal of work into making it fast.

    See, for example:

    https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Feb 12 14:23:50 2026
    From Newsgroup: comp.arch

    John Levine wrote:
    According to Terje Mathisen <terje.mathisen@tmsw.no>:
    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was
    trivially comparable.

    The final option was to extract a short prefix (8-bytes) of the key data
    and accept that the full sort would only deliver a partially sorted
    results, so that I had to do a final sort pass over each chunk of equal
    prefix.

    I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
    deal of work into making it fast.

    See, for example:

    https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf

    My "Algorithms and Data Structures" professor told us that IBM had
    patented a zero-time sort chip, which they then promptly buried since
    their internal stats said that over all IBM mainframes, 60% of the CPU
    hours were spent in various forms of sorting.

    Terje

    PS. The chip was a ladder of comparators, so that you could use a
    channel to stream data into it, then when you were done (or the chip was
    full) you would run the channel transfer in the opposite direction to
    retrieve the sorted data.
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Feb 12 15:47:01 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Terje Mathisen <terje.mathisen@tmsw.no>:
    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract >>the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was >>trivially comparable.

    The final option was to extract a short prefix (8-bytes) of the key data >>and accept that the full sort would only deliver a partially sorted >>results, so that I had to do a final sort pass over each chunk of equal >>prefix.

    I'm pretty sure you'll find all of those in sort programs from the 1960s. >Sorting used more computer time than anything else so they put a great
    deal of work into making it fast.

    Indeed, the sort subsystem on the Burroughs systems was, perhaps,
    even more important that the MCP itself to a lot of customers.

    Although large sorts in those days were done using magnetic tapes.

    (watching a large sort-merge running on a dozen 9-track drives
    was pretty impressive).
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Feb 12 19:25:03 2026
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    I'm pretty sure you'll find all of those in sort programs from the 1960s. >>Sorting used more computer time than anything else so they put a great
    deal of work into making it fast.

    Indeed, the sort subsystem on the Burroughs systems was, perhaps,
    even more important that the MCP itself to a lot of customers.

    Although large sorts in those days were done using magnetic tapes.

    (watching a large sort-merge running on a dozen 9-track drives
    was pretty impressive).

    Yup. I suspect that the read backward feature on tape drives was invented to speed up sorting, when someone realized that if you'd written a set of tapes, you could read them backward and merge into the next set of tapes in reverse order, repeat until done. You might need an extra pass at the end if the final result was an even-numbered pass written in reverse order, but you still won overall because you didn't have to rewind between all the intermediate passes. --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Feb 12 20:01:38 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Scott Lurndal <slp53@pacbell.net>:
    I'm pretty sure you'll find all of those in sort programs from the 1960s. >>>Sorting used more computer time than anything else so they put a great >>>deal of work into making it fast.

    Indeed, the sort subsystem on the Burroughs systems was, perhaps,
    even more important that the MCP itself to a lot of customers.

    Although large sorts in those days were done using magnetic tapes.

    (watching a large sort-merge running on a dozen 9-track drives
    was pretty impressive).

    Yup. I suspect that the read backward feature on tape drives was invented to >speed up sorting,

    Indeed. The Burroughs SORT guru[*] was quite proud of his
    speed enhancements when read-reverse became available.

    His SORT intrinsics (SORT.(60s) and SORT:(70s-90s)) were
    extremely efficient (hand written assembler, for the most part)
    and used by all languages that had sorting capabilities (COBOL68/74/85,
    BPL, SPRITE).

    [*] RIP Bernie Buller, 2023/09/11.

    (He was also a fan of folk music and Oldsmobiles, and a tribute is up on youtube).
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Feb 12 23:07:16 2026
    From Newsgroup: comp.arch

    It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:
    My "Algorithms and Data Structures" professor told us that IBM had
    patented a zero-time sort chip, which they then promptly buried since
    their internal stats said that over all IBM mainframes, 60% of the CPU
    hours were spent in various forms of sorting.

    Terje

    PS. The chip was a ladder of comparators, so that you could use a
    channel to stream data into it, then when you were done (or the chip was >full) you would run the channel transfer in the opposite direction to >retrieve the sorted data.

    Sounds like an urban legend to me. For one thing, on machines of that
    era no useful sort fit in core so it spent most of its time doing disk
    and tape I/O anyway.

    For another, if it really did make a difference, they would have
    called it the Improved Sort Performance Facility and sold it as an
    extra cost option.

    These days z/Series has the optional Enhanced-Sort Facility added in
    September 2020 which adds a complex SORT LISTS instruction that does
    in-memory sorts or merges presuably in microcode so it can run at full
    memory speed.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Feb 12 23:13:23 2026
    From Newsgroup: comp.arch

    According to John Dallman <jgd@cix.co.uk>:
    In article <10ml9ef$264n$1@gal.iecc.com>, johnl@taugh.com (John Levine) >wrote:

    I suspect that the read backward feature on tape drives was
    invented to speed up sorting, when someone realized that if
    you'd written a set of tapes, you could read them backward
    and merge into the next set of tapes in reverse order, repeat
    until done.

    Emerson W Pugh's _IBM's 360 & Early 370 Systems_ says that was the reason. ><https://www.amazon.co.uk/dp/0262517205>

    So it does, on page 229. I have a paper copy of that book that I read long ago, so I guess I can't claim the idea was original.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Feb 13 00:04:24 2026
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:
    My "Algorithms and Data Structures" professor told us that IBM had >patented a zero-time sort chip, which they then promptly buried since >their internal stats said that over all IBM mainframes, 60% of the CPU >hours were spent in various forms of sorting.

    Terje

    PS. The chip was a ladder of comparators, so that you could use a
    channel to stream data into it, then when you were done (or the chip was >full) you would run the channel transfer in the opposite direction to >retrieve the sorted data.

    Sounds like an urban legend to me. For one thing, on machines of that
    era no useful sort fit in core so it spent most of its time doing disk
    and tape I/O anyway.

    I remember hearing about the chip (circa 1986-8).
    I agree that the number of entries would not have been enough to justify.
    I also agree that Interrupt response time after the I/O would have
    prevented any useful performance advantage.

    By the mid-1990s the entry count could have been large enough to justify.

    For another, if it really did make a difference, they would have
    called it the Improved Sort Performance Facility and sold it as an
    extra cost option.

    These days z/Series has the optional Enhanced-Sort Facility added in September 2020 which adds a complex SORT LISTS instruction that does in-memory sorts or merges presuably in microcode so it can run at full
    memory speed.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Feb 12 23:34:10 2026
    From Newsgroup: comp.arch

    On 2/11/2026 2:12 PM, John Levine wrote:
    According to Terje Mathisen <terje.mathisen@tmsw.no>:
    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was
    trivially comparable.

    The final option was to extract a short prefix (8-bytes) of the key data
    and accept that the full sort would only deliver a partially sorted
    results, so that I had to do a final sort pass over each chunk of equal
    prefix.

    I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
    deal of work into making it fast.

    See, for example:

    https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf

    Yup. And note that having a faster sort program than the standard IBM
    one made the company Syncsort a lot of money.

    https://en.wikipedia.org/wiki/Precisely_(company)
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Feb 13 11:54:15 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).

    Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
    /44, user programs had separate instuction and data mappings. The
    smaller ones, the 11/34, /40, and /60 didn't.

    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was
    trivially comparable.

    40 years ago, Duff's device was considered a reasonable way to improve performance. It is now considered a means to take over applications

    Was it ever that?

    Yes, it was a very compact way to express the same logic you would use
    in asm to jump into the correct spot of an unrolled loop, but did it
    actually save any time compared to a plain unroll followed by a final fixup?

    and machines (i.e., a virus).

    Duff would never compile into anything strange, but as a way to
    obfuscate it could work I guess? Absolutley no need to flag it as malware/virus!

    My own code which would jump into the immediate constant part of another instruction, treating data as code, could reasonably trigger some warnings.

    OTOH, "The Story of Mel" will forwever be the shining star for us
    oldtimers, right?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Feb 13 11:36:05 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    John Levine <johnl@taugh.com> posted:

    It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:
    My "Algorithms and Data Structures" professor told us that IBM had
    patented a zero-time sort chip, which they then promptly buried since
    their internal stats said that over all IBM mainframes, 60% of the CPU
    hours were spent in various forms of sorting.

    Terje

    PS. The chip was a ladder of comparators, so that you could use a
    channel to stream data into it, then when you were done (or the chip was >>> full) you would run the channel transfer in the opposite direction to
    retrieve the sorted data.
    Sounds like an urban legend to me. For one thing, on machines of that
    era no useful sort fit in core so it spent most of its time doing disk
    and tape I/O anyway.

    I remember hearing about the chip (circa 1986-8).
    I agree that the number of entries would not have been enough to justify.
    I also agree that Interrupt response time after the I/O would have
    prevented any useful performance advantage.

    By the mid-1990s the entry count could have been large enough to justify.

    For another, if it really did make a difference, they would have
    called it the Improved Sort Performance Facility and sold it as an
    extra cost option.

    These days z/Series has the optional Enhanced-Sort Facility added in
    September 2020 which adds a complex SORT LISTS instruction that does
    in-memory sorts or merges presuably in microcode so it can run at full
    memory speed.


    A rCLZero-TimerCY VLSI Sorter, 1983, IBM Journal of Research and Development https://scholar.archive.org/work/jnapdo24jbfaxhnxbiybonbqxu/access/wayback/http://pdfs.semanticscholar.org/e72e/903f94fb7c121efca13a717a1e713b3321f1.pdf

    "A hardware sorter suitable for VLSI implementation is proposed.
    It operates in a parallel and pipelined fashion, with the actual
    sorting time absorbed by the input/output time. A detailed VLSI
    implementation is described which has a very favorable device
    count compared to existing static RAM."



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Feb 13 18:53:59 2026
    From Newsgroup: comp.arch

    According to Terje Mathisen <terje.mathisen@tmsw.no>:
    40 years ago, Duff's device was considered a reasonable way to improve
    performance. It is now considered a means to take over applications

    Was it ever that?

    Yes, it was a very compact way to express the same logic you would use
    in asm to jump into the correct spot of an unrolled loop, but did it >actually save any time compared to a plain unroll followed by a final fixup?

    Time, no, but on a 64K machine space was precious and it got much of the benefit of unrolling without as much space.

    Duff would never compile into anything strange, but as a way to
    obfuscate it could work I guess? Absolutley no need to flag it as >malware/virus!

    Agreed, it was ugly but quite legit.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Sat Feb 14 10:29:35 2026
    From Newsgroup: comp.arch

    On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup
    <user5857@newsgrouper.org.invalid> wrote:


    40 years ago, Duff's device was considered a reasonable way to improve >performance. It is now considered a means to take over applications
    and machines (i.e., a virus).

    I don't know that it ever really improved performance, but it still is
    useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup
    steps succeed.

    I keep track of how many setup steps successfully complete, and then,
    whatever the exit situation, I jump into a Duff switch that tears it
    all down in reverse order of the setup.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From jm@jm@bourguet.org to comp.arch on Sat Feb 14 21:04:05 2026
    From Newsgroup: comp.arch

    George Neuner <gneuner2@comcast.net> writes:

    On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    40 years ago, Duff's device was considered a reasonable way to improve >>performance. It is now considered a means to take over applications
    and machines (i.e., a virus).

    I don't know that it ever really improved performance, but it still is useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup
    steps succeed.

    I keep track of how many setup steps successfully complete, and then, whatever the exit situation, I jump into a Duff switch that tears it
    all down in reverse order of the setup.

    This seems missing an important aspect of Duff switch: the use of the possibility to put case labels inside control structure. Duff's device use
    a while starting before the first label, but you can do other strange
    things like jumping into an if:

    switch(i) {
    case 1:
    if (j == 3) {
    case 2:
    printf("i == 2 || (i == 1 && j == 3)\n");
    }
    }

    or is you tears down complex enough to use such things?

    Yours,
    --
    Jean-Marc
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Feb 14 20:17:05 2026
    From Newsgroup: comp.arch


    jm <jm@bourguet.org> posted:

    George Neuner <gneuner2@comcast.net> writes:

    On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    40 years ago, Duff's device was considered a reasonable way to improve >>performance. It is now considered a means to take over applications
    and machines (i.e., a virus).

    I don't know that it ever really improved performance, but it still is useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup steps succeed.

    I keep track of how many setup steps successfully complete, and then, whatever the exit situation, I jump into a Duff switch that tears it
    all down in reverse order of the setup.

    This seems missing an important aspect of Duff switch: the use of the possibility to put case labels inside control structure. Duff's device use
    a while starting before the first label, but you can do other strange
    things like jumping into an if:

    switch(i) {
    case 1:
    if (j == 3) {
    case 2:
    printf("i == 2 || (i == 1 && j == 3)\n");
    }
    }
    ----------------------------------------------------------------
    if( i == 2 || (i == 1 && j == 3) )
    printf("i == 2 || (i == 1 && j == 3)\n");

    is going to be fewer instructions.

    or is you tears down complex enough to use such things?

    Yours,

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Sat Feb 14 17:58:00 2026
    From Newsgroup: comp.arch

    On Sat, 14 Feb 2026 21:04:05 +0100, jm <jm@bourguet.org> wrote:

    George Neuner <gneuner2@comcast.net> writes:

    On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup
    <user5857@newsgrouper.org.invalid> wrote:


    40 years ago, Duff's device was considered a reasonable way to improve >>>performance. It is now considered a means to take over applications
    and machines (i.e., a virus).

    I don't know that it ever really improved performance, but it still is
    useful. I use the switch() form for backing out of situations having
    complicated setup where processing isn't possible unless all the setup
    steps succeed.

    I keep track of how many setup steps successfully complete, and then,
    whatever the exit situation, I jump into a Duff switch that tears it
    all down in reverse order of the setup.

    This seems missing an important aspect of Duff switch: the use of the >possibility to put case labels inside control structure. Duff's device use
    a while starting before the first label, but you can do other strange
    things like jumping into an if:

    switch(i) {
    case 1:
    if (j == 3) {
    case 2:
    printf("i == 2 || (i == 1 && j == 3)\n");
    }
    }

    or is you tears down complex enough to use such things?

    Yours,

    I've never had to do anything like your example. I realize that it
    would work, but in fact I've actually never seen anything like that
    done in real life.

    For me, the utility of Duff's switch is that it provides multiple
    entry points into ... what arguably could be considered a single block
    of related code.

    Of course, C permits any case in a switch to fall through. What makes
    a particular case run a "Duff" (to me) is that the run includes at
    least 2 cases which lead to separate[*] code, and every case in the
    run falls through except the last one which exits the switch.

    [*] separate. not necessarily different or unique.

    I try not to put multiple "Duffs" in the same switch. I have had
    occasion to do it, but only very rarely.


    Certainly Duffs can do more than I use them for. I use them only for
    what I need.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Feb 15 09:36:18 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    These days z/Series has the optional Enhanced-Sort Facility added in >September 2020 which adds a complex SORT LISTS instruction that does >in-memory sorts or merges presuably in microcode so it can run at full
    memory speed.

    If the microarchitecture is worth anything, microcode does not run any
    faster than architectural code, when doing the same operations.

    One possible reason for the additional instruction is that there is a
    hardware feature that provides a speedup; in that case the hardware
    feature, not the microcode is the reason for the speedup. Is there
    such a hardware feature for SORT LISTS, and if so, what is it?

    For such hardware features the question is what should be exposed at
    the architectural level. One can use it internally in a high-level
    instruction like SORT LISTS, or expose it more directly in a
    lower-level instruction (e.g., Intel has AESENC, which just performs
    one round of AES encryption).

    The advantage of the lower-level instruction X is that it may have
    other uses, and no microcode for SORT LISTS is necessary (software is
    more malleable); the disadvantage is that the specific functionality
    of the hardware feature is now architectural, i.e., cast in stone, and
    if a different hardware feature in the future would be better or
    cheaper for the SORT LISTS functionality, one still has to provide the
    X functionality in some way, and then decide whether to also implement
    the other functionality, or just live with X.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Feb 15 14:34:11 2026
    From Newsgroup: comp.arch

    John Levine wrote:
    It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:
    My "Algorithms and Data Structures" professor told us that IBM had
    patented a zero-time sort chip, which they then promptly buried since
    their internal stats said that over all IBM mainframes, 60% of the CPU
    hours were spent in various forms of sorting.

    Terje

    PS. The chip was a ladder of comparators, so that you could use a
    channel to stream data into it, then when you were done (or the chip was
    full) you would run the channel transfer in the opposite direction to
    retrieve the sorted data.

    Sounds like an urban legend to me. For one thing, on machines of that
    era no useful sort fit in core so it spent most of its time doing disk
    and tape I/O anyway.

    Professor Halaas (he had written the textbook we used and later on got
    FAST to actually produce a search chip) quoted an actual patent number
    afair.

    And that was probably the main (but not only!) reason for not actually creating the hardware. Besides, patents were already a way to gain bonuses/salary increases, right?

    For another, if it really did make a difference, they would have
    called it the Improved Sort Performance Facility and sold it as an
    extra cost option.

    These days z/Series has the optional Enhanced-Sort Facility added in September 2020 which adds a complex SORT LISTS instruction that does in-memory sorts or merges presuably in microcode so it can run at full
    memory speed.


    As soon as a sort is large enough that it needs to start in physical RAM
    and also end up there, the best you can do is to minimize the number of
    cache and TLB misses: The actual branch misses probably don't matter?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Mon Feb 16 20:55:06 2026
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    These days z/Series has the optional Enhanced-Sort Facility added in >>September 2020 which adds a complex SORT LISTS instruction that does >>in-memory sorts or merges presuably in microcode so it can run at full >>memory speed.

    If the microarchitecture is worth anything, microcode does not run any
    faster than architectural code, when doing the same operations.

    One possible reason for the additional instruction is that there is a >hardware feature that provides a speedup; in that case the hardware
    feature, not the microcode is the reason for the speedup. Is there
    such a hardware feature for SORT LISTS, and if so, what is it?

    It says here it's a combination of hardware and microcode, but the references are all broken or paywalled. It also says that the instruction can run for quite
    a while so it might be that the microcode can run the memory at full speed better
    than ordinary instrucions can.

    https://blog.share.org/Technology-Article/peeking-under-the-hood-of-sort-acceleration-on-z15

    This isn't their first sort speedup. A while ago they added instructions that do
    the inner loop of heapsort. Not sure whether there's hardware for that.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Feb 16 22:16:11 2026
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    These days z/Series has the optional Enhanced-Sort Facility added in >>September 2020 which adds a complex SORT LISTS instruction that does >>in-memory sorts or merges presuably in microcode so it can run at full >>memory speed.

    If the microarchitecture is worth anything, microcode does not run any >faster than architectural code, when doing the same operations.

    One possible reason for the additional instruction is that there is a >hardware feature that provides a speedup; in that case the hardware >feature, not the microcode is the reason for the speedup. Is there
    such a hardware feature for SORT LISTS, and if so, what is it?

    It says here it's a combination of hardware and microcode, but the references are all broken or paywalled. It also says that the instruction can run for quite
    a while so it might be that the microcode can run the memory at full speed better
    than ordinary instrucions can.

    Sounds like a co-processor or an attached-processor using -|code for communication {setup and tear down}. Basically, core-CPU ships request
    to Sorter(tm); sorter does its job, and gets back at lower latency than
    an interrupt. Core-caches will contain post-sorted data.

    https://blog.share.org/Technology-Article/peeking-under-the-hood-of-sort-acceleration-on-z15

    This isn't their first sort speedup. A while ago they added instructions that do
    the inner loop of heapsort. Not sure whether there's hardware for that.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Tue Feb 17 01:19:58 2026
    From Newsgroup: comp.arch

    According to John Levine <johnl@taugh.com>:
    One possible reason for the additional instruction is that there is a >>hardware feature that provides a speedup; in that case the hardware >>feature, not the microcode is the reason for the speedup. Is there
    such a hardware feature for SORT LISTS, and if so, what is it?

    I found a 2020 IBM JR&D article that describes it in some detail.
    It's some fairly simple hardware that keeps pointers to the keys
    in fast SRAM so it can do what IBM calls a looser tree merge sort
    and everyone else seems to call a loser tree merge sort.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Thu Feb 12 20:24:00 2026
    From Newsgroup: comp.arch

    In article <10ml9ef$264n$1@gal.iecc.com>, johnl@taugh.com (John Levine)
    wrote:

    I suspect that the read backward feature on tape drives was
    invented to speed up sorting, when someone realized that if
    you'd written a set of tapes, you could read them backward
    and merge into the next set of tapes in reverse order, repeat
    until done.

    Emerson W Pugh's _IBM's 360 & Early 370 Systems_ says that was the reason. <https://www.amazon.co.uk/dp/0262517205>

    John
    --- Synchronet 3.21d-Linux NewsLink 1.2