• A typical non-loop use case for SIMD

    From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Dec 26 21:57:01 2025
    From Newsgroup: comp.arch

    (This might be blindingly obvious to most regulars, but I thought
    I'd post this, just in case for some discussion)

    SIMD is not always about vectorizing loops, they can also be used
    for tree-shaped reductions (not sure what the canonical name is).

    Consider the following problem: You have 128 consecutive bytes and
    want to find the minimum value, and you have 512-bit SIMD registers.

    You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.

    You do a SIMD 8-bit "min" operation. The target c register contains
    the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
    You then have 64 values.

    You then move the upper values of C into register D (which need not
    be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
    etc.

    You then do the min operation with half the length of the registers,
    giving you 32 values.

    And so on, until you have the single value, which is reached in
    seven steps.

    This kind of thing is, AFAIK, used in high-performance code such
    as JSON parsers or regexp matchers. An example (in principle)
    can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .

    Just wondering... is this somethig that VVM or similar could
    also do? Or does this actually require SIMD and the necessary
    shift, rotate or permute intructions?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 26 22:11:24 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    (This might be blindingly obvious to most regulars, but I thought
    I'd post this, just in case for some discussion)

    SIMD is not always about vectorizing loops, they can also be used
    for tree-shaped reductions (not sure what the canonical name is).

    I use the term "treeification" for this. It is useful in both SW and
    HW applications.

    Consider the following problem: You have 128 consecutive bytes and
    want to find the minimum value, and you have 512-bit SIMD registers.

    You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.

    Since B is A+512 and 512 = -+ of the size, it is akin to how Cooley
    and Tukey turned DFT into FFT--AND THIS is the key to many large scale reduction processes. It also happens to be the way to Vectorize FFTs
    for event faster performance on machines with vector capabilities.

    You do a SIMD 8-bit "min" operation. The target c register contains
    the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
    You then have 64 values.

    You then move the upper values of C into register D (which need not
    be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
    etc.

    You then do the min operation with half the length of the registers,
    giving you 32 values.

    And so on, until you have the single value, which is reached in
    seven steps.

    This kind of thing is, AFAIK, used in high-performance code such
    as JSON parsers or regexp matchers. An example (in principle)
    can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .

    Just wondering... is this somethig that VVM or similar could
    also do? Or does this actually require SIMD and the necessary
    shift, rotate or permute intructions?

    I will get back to you on this.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Dec 26 17:59:21 2025
    From Newsgroup: comp.arch

    Just wondering... is this somethig that VVM or similar could
    also do?

    Assuming we start from a code like

    while (beg < end) {
    char *a = beg;
    char *b = beg + (end - beg) / 2.
    end = b;
    while (a < b) {
    *a = min (*a, *b);
    a++; b++;
    }
    }

    and the VVM loop would be the inner `while`. It seems not fundamentally
    more difficult than recognizing the usual independent "per byte"
    operations and having VVM turn them into "as wide as you can"
    processing, and IIUC Mitch has claimed that it can do that at least for
    some cases, so I guess here as well.

    This said, AFAIU, the first iteration of VVM is "no faster" than without
    VVM, so once we get to the point were beg...end fits into just two
    registers (i.e. the whole inner `while` loop is performed by a single
    SIMD instruction) I think VVM will find it difficult to be competitive
    and when beg..end is larger than two registers, we wouldn't use a tree
    but a simple linear loop applying a "SIMD min" operation on SIMD-sized
    entities (and then finish with the tree-shaped in-register reduction).

    So, I guess for VVM to be competitive it would need a bit more magic
    than what I remember seeing mentioned in this newsgroup. E.g.
    starting with a simpler code like:

    char minchar = 255;
    while (beg < end)
    minchar = min (minchar *beg++);

    The CPU (i.e. VVM) could in theory recognize the pattern as a reduction
    and `min` as associative and do the "linear large steps followed by
    a tree in the end" for you (while still being able to recover the
    precise state in case of an exception in the middle). This seems wildly
    fancy to me, but then again the kind of "OoO processing with 0-cycle
    jumps, hundreds of instructions in flight, instruction fusion, etc" we
    see in everyday CPUs used to be in the "batshit complexity" category
    some years ago.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Dec 26 18:35:21 2025
    From Newsgroup: comp.arch

    On 2025-12-26 4:57 p.m., Thomas Koenig wrote:
    (This might be blindingly obvious to most regulars, but I thought
    I'd post this, just in case for some discussion)

    SIMD is not always about vectorizing loops, they can also be used
    for tree-shaped reductions (not sure what the canonical name is).

    Consider the following problem: You have 128 consecutive bytes and
    want to find the minimum value, and you have 512-bit SIMD registers.

    You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.

    You do a SIMD 8-bit "min" operation. The target c register contains
    the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
    You then have 64 values.

    You then move the upper values of C into register D (which need not
    be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
    etc.

    You then do the min operation with half the length of the registers,
    giving you 32 values.

    And so on, until you have the single value, which is reached in
    seven steps.

    This kind of thing is, AFAIK, used in high-performance code such
    as JSON parsers or regexp matchers. An example (in principle)
    can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .

    Just wondering... is this somethig that VVM or similar could
    also do? Or does this actually require SIMD and the necessary
    shift, rotate or permute intructions?

    RISCV IIRC has several reduction operations including min that finds the minimum of all the values in a vector register so I think it does not
    need a tree. It should only take two steps. I wonder if other
    architectures have the same thing.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Dec 26 17:45:18 2025
    From Newsgroup: comp.arch

    On 12/26/2025 3:57 PM, Thomas Koenig wrote:
    (This might be blindingly obvious to most regulars, but I thought
    I'd post this, just in case for some discussion)

    SIMD is not always about vectorizing loops, they can also be used
    for tree-shaped reductions (not sure what the canonical name is).


    In my use-cases, the dominant use cases for SIMD are often one of:
    3D or 4D XYZ or XYZW vectors;
    RGB or RGBA vectors;
    Quaternions;
    As a row or column for a 4x4 matrix;
    ...

    Which are also:
    Not vectorizing a loop;
    Very neatly optimally suited to 4 element vectors.


    Consider the following problem: You have 128 consecutive bytes and
    want to find the minimum value, and you have 512-bit SIMD registers.

    You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.

    You do a SIMD 8-bit "min" operation. The target c register contains
    the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
    You then have 64 values.

    You then move the upper values of C into register D (which need not
    be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
    etc.

    You then do the min operation with half the length of the registers,
    giving you 32 values.

    And so on, until you have the single value, which is reached in
    seven steps.

    This kind of thing is, AFAIK, used in high-performance code such
    as JSON parsers or regexp matchers. An example (in principle)
    can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .

    Just wondering... is this somethig that VVM or similar could
    also do? Or does this actually require SIMD and the necessary
    shift, rotate or permute intructions?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Dec 27 07:46:33 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    RISCV IIRC has several reduction operations including min that finds the >minimum of all the values in a vector register

    That's very appropriate for a vector extension that allows different SIMD-widths such as the RISC-V vector extensions and ARM SVE. It also
    reduces the amount of code at the end of reduction loops and may
    reduce the latency (depending on the implementation).

    so I think it does not
    need a tree.

    A reduction loop may have to perform several independent reduction
    recurrences in parallel to have a good utilization of the SIMD units.
    E.g., if the microarchitecture has an FP addition latency of 3 cycles
    and can perform 2 SIMD-width FP additions per cycle (and has enough
    other resources, e.g., 2 SIMD-width loads per cycle), 3*2=6
    parallel strands of reduction operations are necessary. And then
    in the end you combine these 6 strands with a tree to minimize the
    latency, and then you can combine the SIMD result with a reduction
    instruction.

    The approach outlined above makes the code somewhat microarchitecture-dependent; catering for more parallel strands then
    necessary does not hurt much, mainly in needing more registers, but if
    you have too few strands, the result will be slower than the CPU is
    capable of.

    An alternative is to use a tree reduction (to SIMD width) inside the
    loop at every step. E.g., for an FP addition reduction of 8 SIMD
    widths per loop iteration:

    /* all r variables and array elements are SIMD width,
    see GNU C vector extensions */
    r0 = 0;
    for (...) {
    r1 = a[i+0];
    r1 += a[i+1];
    r2 = a[i+2];
    r2 += a[i+3];
    r1 += r2;
    r2 = a[i+4];
    r2 += a[i+5];
    r3 = a[i+6];
    r3 += a[i+7];
    r2 += r3;
    r1 += r2;
    r0 += r1; /* the only recurrence in this loop */
    }
    ... /* now reduce r0 to a scalar */

    Note that fewer register names are needed than for an 8-strand
    reduction loop that offers the same amount of instruction-level
    parallelism (ILP).

    You can easily increase this to 32 SIMD widths per iteration, which
    should be good enough for CPUs in the next decade or two.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 27 21:07:10 2025
    From Newsgroup: comp.arch


    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    Thomas Koenig <tkoenig@netcologne.de> posted:

    (This might be blindingly obvious to most regulars, but I thought
    I'd post this, just in case for some discussion)

    SIMD is not always about vectorizing loops, they can also be used
    for tree-shaped reductions (not sure what the canonical name is).

    I use the term "treeification" for this. It is useful in both SW and
    HW applications.

    Without the logarithmic attempts, the general word used to describe
    these things is "reduction".

    Consider the following problem: You have 128 consecutive bytes and
    want to find the minimum value, and you have 512-bit SIMD registers.

    You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.

    Since B is A+512 and 512 = -+ of the size, it is akin to how Cooley
    and Tukey turned DFT into FFT--AND THIS is the key to many large scale reduction processes. It also happens to be the way to Vectorize FFTs
    for event faster performance on machines with vector capabilities.

    You do a SIMD 8-bit "min" operation. The target c register contains
    the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
    You then have 64 values.

    You then move the upper values of C into register D (which need not
    be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
    etc.

    You then do the min operation with half the length of the registers,
    giving you 32 values.

    And so on, until you have the single value, which is reached in
    seven steps.

    This kind of thing is, AFAIK, used in high-performance code such
    as JSON parsers or regexp matchers. An example (in principle)
    can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .

    Just wondering... is this somethig that VVM or similar could
    also do? Or does this actually require SIMD and the necessary
    shift, rotate or permute intructions?

    I will get back to you on this.

    After considerable thought::

    VVM has the ability to choose execution width (based on HW resources
    and based on data recurrence). In the past I have given examples
    where VVM is executing at "width" and then because of a memory
    "alias" has to drop back to 1-wide until the pointers cross before
    reverting back to full width.

    This algorithm (reduction) is a modification to dynamic width control,
    where width is constant until the final K iterations and then decreases
    by -+ each iteration thereafter. So, fundamentally, VVM does not have
    a problem with reductions "expressed right".

    However, the given problem of 512-bits (64-bytes) might not find much
    if any speedup, due to initialization, and a potential stutter step
    on each DIV-2 iteration.

    It might be better to allow the HW to recognize some inst have
    certain properties and integrate those into VVM recognition so
    that VVM performs a wide calculation; roughly akin to the
    following:

    for(...)
    local_minimum[ k>>3 ] = MIN( a[k,k+7] ); k+=8;
    for(...)
    global_minimum = MIN( local_minimum[i,i+7] ); i+=8;

    For sizes as small as 512-bits, VVM might not have an advantage. On the
    other hand, if HW knew certain things about some instructions, the top
    loop might be performed simultaneously with the bottom loop--more or
    less like having an adder that performs {8|u64, 16|u32, 32|u16, 64|u8} calculations simultaneously in reduction form {Exact for integer,
    Single rounding for 2^n FPs} and this wide adder feeds the second
    calculation 1 K|u reduction per cycle in a single merged loop.

    Needs more thought. {Known problem}
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Dec 28 10:24:43 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Single rounding for 2^n FPs

    Given that usual FP arithmetics does not satisfy the associative law,
    one cannot transform easily-written reductions like

    double r=0.0
    for (...)
    r += a[i];

    into any more efficient form (e.g., one with several computation
    strands or one that uses a complete tree evaluation). Compilers use
    the allowance given by -ffast-math to reassociate the computations and
    to vectorize such code. Of course, if you implement FP operations in
    general without inexact results, your FP operations are associative,
    but I fail to see how that can be achieved in general.

    In the above, the programmer could use "accumulator" instead of
    "double" for the type of r, indicating to the compiler that all the
    additions are exact, and only when r is assigned to a double does the
    rounding happen. The compiler would compile the + to an
    add-to-accumulator instruction, and the VVM hardware, upon seeing such instructions, would know that they follow the associative law and can
    be reassociated.

    However, my impression has been that you only intend to have one
    physical accumulator, which results in the questions of how several
    array elements per cycle could be added to this accumulator. Also, if
    several strands of recurrences are appropriate, how is that done?

    Also, what happens if the loop performs two reductions that would need
    the accumulator?

    Finally, most code will not be my66000-specific and will use the type
    "double" in code as above; ok, this code will be compiled with
    -ffast-math if he programmer intends it to be vectorized, but how is
    that compiler flag encoded in the nachine code such that VVM knows
    that it can reassociate these additions?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 28 14:19:53 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    [...]

    Without the logarithmic attempts, the general word used to describe
    these things is "reduction".

    Ah, yes, of course.

    [...]


    VVM has the ability to choose execution width (based on HW resources
    and based on data recurrence). In the past I have given examples
    where VVM is executing at "width" and then because of a memory
    "alias" has to drop back to 1-wide until the pointers cross before
    reverting back to full width.

    This algorithm (reduction) is a modification to dynamic width control,
    where width is constant until the final K iterations and then decreases
    by -+ each iteration thereafter. So, fundamentally, VVM does not have
    a problem with reductions "expressed right".

    How would this be expressed? As straightforward serialized code in
    a VVM loop? This would suffer from needing memory accesses to
    store and reload intermediate results.

    [...]

    However, the given problem of 512-bits (64-bytes) might not find much
    if any speedup, due to initialization, and a potential stutter step
    on each DIV-2 iteration.


    It might be better to allow the HW to recognize some inst have
    certain properties and integrate those into VVM recognition so
    that VVM performs a wide calculation; roughly akin to the
    following:

    for(...)
    local_minimum[ k>>3 ] = MIN( a[k,k+7] ); k+=8;
    for(...)
    global_minimum = MIN( local_minimum[i,i+7] ); i+=8;

    An issue here could be the "as if" serial nature of VVM, I think -
    if an interrupt occurs, you would have to restore the state.

    For sizes as small as 512-bits, VVM might not have an advantage.

    Do you mean over SIMD or straightforward serial code? I would be
    good if VVM came within shouting distance of SIMD.

    On the
    other hand, if HW knew certain things about some instructions, the top
    loop might be performed simultaneously with the bottom loop--more or
    less like having an adder that performs {8|u64, 16|u32, 32|u16, 64|u8} calculations simultaneously in reduction form {Exact for integer,
    Single rounding for 2^n FPs} and this wide adder feeds the second
    calculation 1 K|u reduction per cycle in a single merged loop.

    Needs more thought. {Known problem}

    I would guess so...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 17:48:02 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Single rounding for 2^n FPs

    Given that usual FP arithmetics does not satisfy the associative law,
    one cannot transform easily-written reductions like

    double r=0.0
    for (...)
    r += a[i];

    Anton, I am surprised you have not heard of Quires !!! where the
    accumulator is the full exponent and fraction width. Posits have
    it, I have it on authority that IEEE 754 is doing something along
    this for next time.

    Give a Quire, one can accumulate any number of terms and get the
    right answer from a single rounding of all terms. Accumulations
    into a quire DO satisfy Associativity.

    But you are right, given IEEE 754 AS IS you cannot. And this is the
    problem.

    into any more efficient form (e.g., one with several computation
    strands or one that uses a complete tree evaluation). Compilers use
    the allowance given by -ffast-math to reassociate the computations and
    to vectorize such code. Of course, if you implement FP operations in
    general without inexact results, your FP operations are associative,
    but I fail to see how that can be achieved in general.

    Also, given a Quire, the compiler becomes FREE to reorder arithmetic
    terms in a reduction.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Dec 28 18:38:48 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Anton, I am surprised you have not heard of Quires !!! where the
    accumulator is the full exponent and fraction width.
    ...
    Also, given a Quire, the compiler becomes FREE to reorder arithmetic
    terms in a reduction.

    Am I surprised that you did not read all of what I wrote?
    Unfortunately not.

    Anyway, maybe this time you will read it, and I will spell it out more
    clearly:

    With two 8-wide SIMD DP FP adders and three cycles of latency, you can
    use six strands of SIMD addition (48 strands of scalar addition) to
    make full use of the SIMD units.

    How many accumulators or (if you really need to use that term) quires
    do you have in your architecture and in your microarchitecture; how
    many DP FP additions to one accumulator/quire can be started per
    cycle? And how long does the next bunch of additions then have to
    wait before being started?

    If the end result cannot compete with SIMD units on the machines where
    the programs run, I fear that the accumulator/quire will stay a
    feature for a certain niche of users, while another group of users
    will ignore it for performance reasons.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 19:01:58 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Anton, I am surprised you have not heard of Quires !!! where the >accumulator is the full exponent and fraction width.
    ..
    Also, given a Quire, the compiler becomes FREE to reorder arithmetic
    terms in a reduction.

    Am I surprised that you did not read all of what I wrote?
    Unfortunately not.

    Anyway, maybe this time you will read it, and I will spell it out more clearly:

    With two 8-wide SIMD DP FP adders and three cycles of latency, you can
    use six strands of SIMD addition (48 strands of scalar addition) to
    make full use of the SIMD units.

    How many accumulators or (if you really need to use that term) quires
    do you have in your architecture and in your microarchitecture; how
    many DP FP additions to one accumulator/quire can be started per
    cycle?

    In a non-CPU design I am working on (AI) we start 32-FMA (F32) per cycle
    that gets reduced into a quire every cycle. 32-FMAs per cycle is the
    BW of the HBMs we are going to use. So, every 24-cycles we compute
    768 FMAs into a single result. 1024-bits from DRAM per cycle at 4 GHz (0.5TB/s).

    Note: this is not a "programmable unit" it is more like a GPU fixed FU dedicated to an AI HSNW workload.

    And how long does the next bunch of additions then have to
    wait before being started?

    1-cycle after the final reduction begins, you can start a new FADD or FACC.

    If the end result cannot compete with SIMD units on the machines where
    the programs run, I fear that the accumulator/quire will stay a
    feature for a certain niche of users, while another group of users
    will ignore it for performance reasons.

    I will let you reason that one out for yourself.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Dec 28 22:12:30 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Anton, I am surprised you have not heard of Quires !!! where the
    accumulator is the full exponent and fraction width.
    ...
    Also, given a Quire, the compiler becomes FREE to reorder arithmetic
    terms in a reduction.

    Am I surprised that you did not read all of what I wrote?
    Unfortunately not.

    Anyway, maybe this time you will read it, and I will spell it out more clearly:

    With two 8-wide SIMD DP FP adders and three cycles of latency, you can
    use six strands of SIMD addition (48 strands of scalar addition) to
    make full use of the SIMD units.

    How many accumulators or (if you really need to use that term) quires
    do you have in your architecture and in your microarchitecture; how
    many DP FP additions to one accumulator/quire can be started per
    cycle? And how long does the next bunch of additions then have to
    wait before being started?

    If the end result cannot compete with SIMD units on the machines where
    the programs run, I fear that the accumulator/quire will stay a
    feature for a certain niche of users, while another group of users
    will ignore it for performance reasons.

    Mitch have already returned with a unit which can do 32 float aggregations/cycle, so obviously quite fast.

    If I had to design such a beast, I would be very tempted to use a
    carry-save format, i.e each of the ~1080 bits in the accumulator is
    stored as a bit value plus a possible carry from the bit below.

    This way each double or float value to be aggregated just needs to be
    aligned (per exponent), then a per-bit full adder would require about 5
    gates and just 2 gate delays?

    This should be sufficient to handle at least 6-8 new values per cycle,
    right?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Dec 28 18:51:31 2025
    From Newsgroup: comp.arch

    On 12/26/2025 1:57 PM, Thomas Koenig wrote:
    (This might be blindingly obvious to most regulars, but I thought
    I'd post this, just in case for some discussion)

    SIMD is not always about vectorizing loops, they can also be used
    for tree-shaped reductions (not sure what the canonical name is).

    Consider the following problem: You have 128 consecutive bytes and
    want to find the minimum value, and you have 512-bit SIMD registers.

    Thomas, this is an excellent "test case" as it brings out at least two
    issues. There has been discussion in this thread about the "reduction" problem. Let me start on the other problem, that I call ALU
    underutilization. It is caused by requiring lots of simple operations
    on small data elements. For this example, I assume a four wide My 66000.

    Lets look at just the first pass. I think the simplest coding would
    have the VVM loop consisting of two load instructions, two add
    instructions to increment the addresses and a min instruction. Letting
    VVM do its magic, this would generate 4 byte min operations at a time,
    (one per ALU) and thus the loop would be executed 64/4 = 16 times. I
    don't know how your hypothetical SIMD machine would do this, but it
    might do all 64 min operations in a single operation, or perhaps 2.
    This puts VVM at a substantial performance disadvantage.

    I have a possible suggestion to help this. I don't claim it is the best solution.

    The problem stems from using only 8 bits of the 64 bit integer ALU for
    each operation, leading to more operations. So one possible solution
    would be to add a new instruction modifier that tells the system that
    any relevant operations under its mask will do the whole register worth
    of operations using the size already specified in the the operation.
    Since the min instruction would already have specified bytes, with the modification, the instruction would do 8 byte min operations at once,
    this reducing the loop count by a factor of 8. Of course, this
    generalized to half words and words as well, and to similar "simple" instructions such as add/subtract, etc. Note that this already "fits"
    in the existing 64 bit ALUs, with the addition of a little logic to
    suppress carries, etc. to allow the simultaneous use of all the ALU bits.

    Comments?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 29 19:59:38 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/26/2025 1:57 PM, Thomas Koenig wrote:
    (This might be blindingly obvious to most regulars, but I thought
    I'd post this, just in case for some discussion)

    SIMD is not always about vectorizing loops, they can also be used
    for tree-shaped reductions (not sure what the canonical name is).

    Consider the following problem: You have 128 consecutive bytes and
    want to find the minimum value, and you have 512-bit SIMD registers.

    Thomas, this is an excellent "test case" as it brings out at least two issues. There has been discussion in this thread about the "reduction" problem. Let me start on the other problem, that I call ALU underutilization. It is caused by requiring lots of simple operations
    on small data elements. For this example, I assume a four wide My 66000.

    Lets look at just the first pass. I think the simplest coding would
    have the VVM loop consisting of two load instructions, two add
    instructions to increment the addresses and a min instruction. Letting
    VVM do its magic, this would generate 4 byte min operations at a time,
    (one per ALU) and thus the loop would be executed 64/4 = 16 times. I
    don't know how your hypothetical SIMD machine would do this, but it
    might do all 64 min operations in a single operation, or perhaps 2.
    This puts VVM at a substantial performance disadvantage.

    I have a possible suggestion to help this. I don't claim it is the best solution.

    The problem stems from using only 8 bits of the 64 bit integer ALU for
    each operation, leading to more operations. So one possible solution
    would be to add a new instruction modifier that tells the system that
    any relevant operations under its mask will do the whole register worth
    of operations using the size already specified in the the operation.

    This is exactly what VVM does, BTW. Smaller than register widths are
    SIMDed into single "units of work" up to register width and performed
    with the carry-chains clipped.

    Since the min instruction would already have specified bytes,

    It is the memory instruction that specifies data width.

    with the modification, the instruction would do 8 byte min operations at once,
    this reducing the loop count by a factor of 8.

    Yes, exactly.

    Of course, this
    generalized to half words and words as well, and to similar "simple" instructions such as add/subtract, etc.

    All that is modified is where the carry-chains get clipped.

    Note that this already "fits"
    in the existing 64 bit ALUs, with the addition of a little logic to
    suppress carries, etc. to allow the simultaneous use of all the ALU bits.

    Comments?



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Dec 29 12:14:26 2025
    From Newsgroup: comp.arch

    On 12/29/2025 11:59 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/26/2025 1:57 PM, Thomas Koenig wrote:
    (This might be blindingly obvious to most regulars, but I thought
    I'd post this, just in case for some discussion)

    SIMD is not always about vectorizing loops, they can also be used
    for tree-shaped reductions (not sure what the canonical name is).

    Consider the following problem: You have 128 consecutive bytes and
    want to find the minimum value, and you have 512-bit SIMD registers.

    Thomas, this is an excellent "test case" as it brings out at least two
    issues. There has been discussion in this thread about the "reduction"
    problem. Let me start on the other problem, that I call ALU
    underutilization. It is caused by requiring lots of simple operations
    on small data elements. For this example, I assume a four wide My 66000.

    Lets look at just the first pass. I think the simplest coding would
    have the VVM loop consisting of two load instructions, two add
    instructions to increment the addresses and a min instruction. Letting
    VVM do its magic, this would generate 4 byte min operations at a time,
    (one per ALU) and thus the loop would be executed 64/4 = 16 times. I
    don't know how your hypothetical SIMD machine would do this, but it
    might do all 64 min operations in a single operation, or perhaps 2.
    This puts VVM at a substantial performance disadvantage.

    I have a possible suggestion to help this. I don't claim it is the best
    solution.

    The problem stems from using only 8 bits of the 64 bit integer ALU for
    each operation, leading to more operations. So one possible solution
    would be to add a new instruction modifier that tells the system that
    any relevant operations under its mask will do the whole register worth
    of operations using the size already specified in the the operation.

    This is exactly what VVM does, BTW. Smaller than register widths are
    SIMDed into single "units of work" up to register width and performed
    with the carry-chains clipped.

    Oh, I didn't realize that VVM already did that. Bravo!



    Since the min instruction would already have specified bytes,

    It is the memory instruction that specifies data width.

    I thought with your latest modifications to the ISA that instructions
    like min specified a data width. But using the width specified in the
    memory reference instruction seems fine. I can't think of a useful case
    where the two would be different.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 29 21:17:53 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/29/2025 11:59 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/26/2025 1:57 PM, Thomas Koenig wrote:
    (This might be blindingly obvious to most regulars, but I thought
    I'd post this, just in case for some discussion)

    SIMD is not always about vectorizing loops, they can also be used
    for tree-shaped reductions (not sure what the canonical name is).

    Consider the following problem: You have 128 consecutive bytes and
    want to find the minimum value, and you have 512-bit SIMD registers.

    Thomas, this is an excellent "test case" as it brings out at least two
    issues. There has been discussion in this thread about the "reduction"
    problem. Let me start on the other problem, that I call ALU
    underutilization. It is caused by requiring lots of simple operations
    on small data elements. For this example, I assume a four wide My 66000. >>
    Lets look at just the first pass. I think the simplest coding would
    have the VVM loop consisting of two load instructions, two add
    instructions to increment the addresses and a min instruction. Letting
    VVM do its magic, this would generate 4 byte min operations at a time,
    (one per ALU) and thus the loop would be executed 64/4 = 16 times. I
    don't know how your hypothetical SIMD machine would do this, but it
    might do all 64 min operations in a single operation, or perhaps 2.
    This puts VVM at a substantial performance disadvantage.

    I have a possible suggestion to help this. I don't claim it is the best >> solution.

    The problem stems from using only 8 bits of the 64 bit integer ALU for
    each operation, leading to more operations. So one possible solution
    would be to add a new instruction modifier that tells the system that
    any relevant operations under its mask will do the whole register worth
    of operations using the size already specified in the the operation.

    This is exactly what VVM does, BTW. Smaller than register widths are
    SIMDed into single "units of work" up to register width and performed
    with the carry-chains clipped.

    Oh, I didn't realize that VVM already did that. Bravo!

    It is how the HW works::

    An integer adder is comprised of 8|u9-bit sections. You feed 8-data
    bits into each section, and you feed into 9th-bit::
    00 if you want the carry clipped,
    01 if you want the carry propagated,
    11 if you want a carry generated for next section;

    An adder comprised of 9-bit sections is no more gates of delay than one comprised of 8-bit sections.

    Since the min instruction would already have specified bytes,

    It is the memory instruction that specifies data width.

    I thought with your latest modifications to the ISA that instructions
    like min specified a data width. But using the width specified in the memory reference instruction seems fine. I can't think of a useful case where the two would be different.

    Whereas the current ISA does have size/calculation, I have not had the time
    to go back and examine vVM to the extent necessary to make any statements
    on how vVM would work with these.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Jan 1 09:51:17 2026
    From Newsgroup: comp.arch

    On 12/29/2025 1:17 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/29/2025 11:59 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/26/2025 1:57 PM, Thomas Koenig wrote:
    (This might be blindingly obvious to most regulars, but I thought
    I'd post this, just in case for some discussion)

    SIMD is not always about vectorizing loops, they can also be used
    for tree-shaped reductions (not sure what the canonical name is).

    Consider the following problem: You have 128 consecutive bytes and
    want to find the minimum value, and you have 512-bit SIMD registers.

    Thomas, this is an excellent "test case" as it brings out at least two >>>> issues. There has been discussion in this thread about the "reduction" >>>> problem. Let me start on the other problem, that I call ALU
    underutilization. It is caused by requiring lots of simple operations >>>> on small data elements. For this example, I assume a four wide My 66000. >>>>
    Lets look at just the first pass. I think the simplest coding would
    have the VVM loop consisting of two load instructions, two add
    instructions to increment the addresses and a min instruction. Letting >>>> VVM do its magic, this would generate 4 byte min operations at a time, >>>> (one per ALU) and thus the loop would be executed 64/4 = 16 times. I
    don't know how your hypothetical SIMD machine would do this, but it
    might do all 64 min operations in a single operation, or perhaps 2.
    This puts VVM at a substantial performance disadvantage.

    I have a possible suggestion to help this. I don't claim it is the best >>>> solution.

    The problem stems from using only 8 bits of the 64 bit integer ALU for >>>> each operation, leading to more operations. So one possible solution
    would be to add a new instruction modifier that tells the system that
    any relevant operations under its mask will do the whole register worth >>>> of operations using the size already specified in the the operation.

    This is exactly what VVM does, BTW. Smaller than register widths are
    SIMDed into single "units of work" up to register width and performed
    with the carry-chains clipped.

    Oh, I didn't realize that VVM already did that. Bravo!

    It is how the HW works::

    An integer adder is comprised of 8|u9-bit sections. You feed 8-data
    bits into each section, and you feed into 9th-bit::
    00 if you want the carry clipped,
    01 if you want the carry propagated,
    11 if you want a carry generated for next section;

    An adder comprised of 9-bit sections is no more gates of delay than one comprised of 8-bit sections.

    I understand that can be done fairly easily. A minor point you seem to
    be specifying a two bit value into a single (the ninth) bit.

    But my astonishment is that the HW as you say "SIMD"s eight byte
    operations into a single 64 bit operation. Great for efficiency, but it
    seems to violate the syntax of what was in the source code, which was
    single byte operations. For Min, that part isn't an issue (but see
    below), but the syntax specified ending up with a single byte. Does the
    hardware "automagically" do some hidden min operations among the eight
    bytes in the register to end up with a single byte?

    Note also that this works well for min (and max), but for add, etc. if
    the data is unsigned only if you don't care where in the sequence an
    overflow occurred, but but if the data is signed, only if you don't even
    care if overflow occurred.


    Since the min instruction would already have specified bytes,

    It is the memory instruction that specifies data width.

    I thought with your latest modifications to the ISA that instructions
    like min specified a data width. But using the width specified in the
    memory reference instruction seems fine. I can't think of a useful case
    where the two would be different.

    Whereas the current ISA does have size/calculation, I have not had the time to go back and examine vVM to the extent necessary to make any statements
    on how vVM would work with these.

    Fair enough!
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Jan 1 22:21:36 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Note also that this works well for min (and max), but for add, etc. if
    the data is unsigned only if you don't care where in the sequence an >overflow occurred, but but if the data is signed, only if you don't even >care if overflow occurred.

    Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.

    If you have a sticky overflow bit (Power has something in that
    direction), or if you trap on overflow (MIPS and Alpha have such
    instructions for signed addition), you will certainly notice if an
    overflow occured for signed additions for the particular evaluation
    used on the machine. If there are evaluations with and without
    overflow, I think that those without are preferable (although probably
    not enough to pay the performance cost for ensuring that).

    Which programming language are you thinking of where reduction uses
    some addition with overflow detection.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Jan 1 22:15:56 2026
    From Newsgroup: comp.arch

    On 1/1/2026 2:21 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Note also that this works well for min (and max), but for add, etc. if
    the data is unsigned only if you don't care where in the sequence an
    overflow occurred, but but if the data is signed, only if you don't even
    care if overflow occurred.

    Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.

    Yes. You have given a more precise mathematical statement of of what I
    said above.


    If you have a sticky overflow bit (Power has something in that
    direction), or if you trap on overflow (MIPS and Alpha have such
    instructions for signed addition), you will certainly notice if an
    overflow occured for signed additions for the particular evaluation
    used on the machine. If there are evaluations with and without
    overflow, I think that those without are preferable (although probably
    not enough to pay the performance cost for ensuring that).

    Which programming language are you thinking of where reduction uses
    some addition with overflow detection.

    I wasn't thinking of any programming language in particular. I was
    trying to clarify my "astonishment" and work through the details of what
    Mitch said earlier about SIMDifying the calculations in the context of
    VVM. I was trying to understand exactly what the hardware was doing in
    the various cases.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jan 2 09:19:44 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    I was
    trying to clarify my "astonishment" and work through the details of what >Mitch said earlier about SIMDifying the calculations in the context of
    VVM. I was trying to understand exactly what the hardware was doing in
    the various cases.

    I don't know how MY66000 deals with overflow detection. I assume that
    MY66000 has instructions that add modulo 2^n. If the code uses these instructions (as code compiled from typical C, C++, or Java code
    does), this is no obstacle to reassociation and to vectorizing the
    reduction.

    A bigger problem may be that in C (and a number of other languages)
    all shorter integers are implicitly converted to int before any
    operation, and the user is likely to use at least int, but maybe even
    long for the accumulator in the reduction:

    /* int8_t a[n]; */
    long r=0;
    for (i=0; i<n; i++)
    r += a[i];

    Here the compiler will compile the a[i] into an sign-extending load
    and the += into a 64-bit addition. This means that any kind of auto-vectorization (whether the existing one using SIMD instruction or
    the hardware implementation in VVM will have to do the sign-extending
    expansion to 64-bit values, reducing the speedup. In the unlikely
    case that the programmers are interested in a result modulo 2^8, they
    can write

    int8_t r=0;

    instead (and compile with -fwrapv, which I recommend in any case); I
    think that MY66000 has width on computation operations, which I
    usually consider a waste for encoding bits, but in connection with VVM
    it works out: The compiler can optimize the "r += ..." into an
    addition modulo 2^8, and VVM can then use parallel 8-bit additions to
    speed up the loop part of the reduction.

    If somebody writes

    /* int8_t a[n]; */
    int r=INT_MAX;
    for (i=0; i<n; i++)
    r = min(r,a[i]);

    Then a sufficiently advanced compiler could conclude that, as soon as
    0, the range of r is the range of int8_n and also use 8-bit min in
    that case.

    One would then need such an advanced compiler to produce an 8-bit min instruction for MY66000, and the VVM could achieve the same speedup.
    But the compiler does not really get simpler.

    Alternatively, the programmer could write

    int8_t r=0;

    and the compiler would need to be less advanced to vectorize it using
    SIMD instructions, and even less advaned for VVM.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 2 18:52:29 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 1/1/2026 2:21 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Note also that this works well for min (and max), but for add, etc. if
    the data is unsigned only if you don't care where in the sequence an
    overflow occurred, but but if the data is signed, only if you don't even >> care if overflow occurred.

    Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.

    Yes. You have given a more precise mathematical statement of of what I
    said above.

    Just ANNOTHER reason the default integer should be UNSIGNED.


    If you have a sticky overflow bit (Power has something in that
    direction), or if you trap on overflow (MIPS and Alpha have such instructions for signed addition), you will certainly notice if an
    overflow occured for signed additions for the particular evaluation
    used on the machine. If there are evaluations with and without
    overflow, I think that those without are preferable (although probably
    not enough to pay the performance cost for ensuring that).

    Which programming language are you thinking of where reduction uses
    some addition with overflow detection.

    I wasn't thinking of any programming language in particular. I was
    trying to clarify my "astonishment" and work through the details of what Mitch said earlier about SIMDifying the calculations in the context of
    VVM. I was trying to understand exactly what the hardware was doing in
    the various cases.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 2 18:52:50 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 1/1/2026 2:21 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Note also that this works well for min (and max), but for add, etc. if
    the data is unsigned only if you don't care where in the sequence an
    overflow occurred, but but if the data is signed, only if you don't even >> care if overflow occurred.

    Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.

    Yes. You have given a more precise mathematical statement of of what I
    said above.

    Just ANNOTHER reason the default integer should be UNSIGNED.


    If you have a sticky overflow bit (Power has something in that
    direction), or if you trap on overflow (MIPS and Alpha have such instructions for signed addition), you will certainly notice if an
    overflow occured for signed additions for the particular evaluation
    used on the machine. If there are evaluations with and without
    overflow, I think that those without are preferable (although probably
    not enough to pay the performance cost for ensuring that).

    Which programming language are you thinking of where reduction uses
    some addition with overflow detection.

    I wasn't thinking of any programming language in particular. I was
    trying to clarify my "astonishment" and work through the details of what Mitch said earlier about SIMDifying the calculations in the context of
    VVM. I was trying to understand exactly what the hardware was doing in
    the various cases.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 2 19:01:29 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    I was
    trying to clarify my "astonishment" and work through the details of what >Mitch said earlier about SIMDifying the calculations in the context of >VVM. I was trying to understand exactly what the hardware was doing in >the various cases.

    I don't know how MY66000 deals with overflow detection.

    Any significance beyond the signed result container size.

    However, last week someone suggested a 3-input addition, which I gave
    some thought--Exactly what is the overflow condition of a ADD3 when
    2-operands have one sign bit and the other has the opposite sign bit ??

    In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
    Is there a similar 3-input equation ??

    I assume that MY66000 has instructions that add modulo 2^n.

    Called, surprisingly, Unsigned.

    If the code uses these instructions (as code compiled from typical C, C++, or Java code
    does), this is no obstacle to reassociation and to vectorizing the
    reduction.

    Unsigned have a lot fewer surprised than Signed.

    A bigger problem may be that in C (and a number of other languages)
    all shorter integers are implicitly converted to int before any
    operation,

    No, unsigned char is converted to unsigned int not int.
    up conversions maintain signedness.

    and the user is likely to use at least int, but maybe even
    long for the accumulator in the reduction:

    ----------------

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jan 2 19:29:41 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    However, last week someone suggested a 3-input addition, which I gave
    some thought--Exactly what is the overflow condition of a ADD3 when >2-operands have one sign bit and the other has the opposite sign bit ??

    In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
    Is there a similar 3-input equation ??

    I doubt it. If you really care for overflow, just produce a 66-bit
    result, and check if it is in the range [-2^63,2^63).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Jan 2 14:52:02 2026
    From Newsgroup: comp.arch

    However, last week someone suggested a 3-input addition, which I gave
    some thought--Exactly what is the overflow condition of a ADD3 when 2-operands have one sign bit and the other has the opposite sign bit ??

    In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
    Is there a similar 3-input equation ??

    How 'bout sign-extending your 64bits to 65bits, then doing your ADD3 and
    then making sure the top 2 bits of the result are equal?


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Jan 2 18:56:32 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    However, last week someone suggested a 3-input addition, which I gave
    some thought--Exactly what is the overflow condition of a ADD3 when 2-operands have one sign bit and the other has the opposite sign bit ??

    In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
    Is there a similar 3-input equation ??

    res1 = a + b
    ovr1 = (res1<63> ^ a<63>) & (res1<63> ^ b<63>)
    res2 = res1 + c
    ovr2 = ((res2<63> ^ res1<63>) & (res2<63> ^ c<63>)) | ovr1

    This detects if either (a+b) or (res1+c) overflows but also
    signals overflow if (a+b) overflows but then +c wraps it back in range.
    That is, a = INT_MAX, b = 1, c = -1, then a+b+c == INT_MAX
    but the above would set ovr1 = 1 because a+b overflowed.

    This unnecessary transitory overflow (one interior to an expression)
    can be avoided by sign extending a,b,c to 66 bits, doing the add,
    and checking result<65:63> are all 1 or 0.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jan 3 01:49:10 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    However, last week someone suggested a 3-input addition, which I gave
    some thought--Exactly what is the overflow condition of a ADD3 when >2-operands have one sign bit and the other has the opposite sign bit ??

    In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
    Is there a similar 3-input equation ??

    I doubt it. If you really care for overflow, just produce a 66-bit
    result, and check if it is in the range [-2^63,2^63).

    Which is why the spec is written "significance above container size"
    That is: 'above container' not equal 0 or -1.

    Works for + with CARRY, *, 128-bit->64-bit DIV, shifts, ...


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jan 3 01:50:37 2026
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    However, last week someone suggested a 3-input addition, which I gave
    some thought--Exactly what is the overflow condition of a ADD3 when 2-operands have one sign bit and the other has the opposite sign bit ??

    In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
    Is there a similar 3-input equation ??

    How 'bout sign-extending your 64bits to 65bits, then doing your ADD3 and
    then making sure the top 2 bits of the result are equal?

    In HW one can simply setup a data-path and check <127..64> = 0 or -1.
    Remember I have CARRY on this data-path.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2