Forum: Too Lazy BBS

A typical non-loop use case for SIMD

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Dec 26 21:57:01 2025

From Newsgroup: comp.arch

(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)

SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).

Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.

You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.

You do a SIMD 8-bit "min" operation. The target c register contains
the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
You then have 64 values.

You then move the upper values of C into register D (which need not
be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
etc.

You then do the min operation with half the length of the registers,
giving you 32 values.

And so on, until you have the single value, which is reached in
seven steps.

This kind of thing is, AFAIK, used in high-performance code such
as JSON parsers or regexp matchers. An example (in principle)
can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .

Just wondering... is this somethig that VVM or similar could
also do? Or does this actually require SIMD and the necessary
shift, rotate or permute intructions?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 26 22:11:24 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)

SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).

I use the term "treeification" for this. It is useful in both SW and
HW applications.

Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.

You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.

Since B is A+512 and 512 = -+ of the size, it is akin to how Cooley
and Tukey turned DFT into FFT--AND THIS is the key to many large scale reduction processes. It also happens to be the way to Vectorize FFTs
for event faster performance on machines with vector capabilities.

You do a SIMD 8-bit "min" operation. The target c register contains
the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
You then have 64 values.

You then move the upper values of C into register D (which need not
be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
etc.

You then do the min operation with half the length of the registers,
giving you 32 values.

And so on, until you have the single value, which is reached in
seven steps.

This kind of thing is, AFAIK, used in high-performance code such
as JSON parsers or regexp matchers. An example (in principle)
can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .

Just wondering... is this somethig that VVM or similar could
also do? Or does this actually require SIMD and the necessary
shift, rotate or permute intructions?

I will get back to you on this.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Dec 26 17:59:21 2025

From Newsgroup: comp.arch

Just wondering... is this somethig that VVM or similar could
also do?

Assuming we start from a code like

while (beg < end) {
char *a = beg;
char *b = beg + (end - beg) / 2.
end = b;
while (a < b) {
*a = min (*a, *b);
a++; b++;
}
}

and the VVM loop would be the inner `while`. It seems not fundamentally
more difficult than recognizing the usual independent "per byte"
operations and having VVM turn them into "as wide as you can"
processing, and IIUC Mitch has claimed that it can do that at least for
some cases, so I guess here as well.

This said, AFAIU, the first iteration of VVM is "no faster" than without
VVM, so once we get to the point were beg...end fits into just two
registers (i.e. the whole inner `while` loop is performed by a single
SIMD instruction) I think VVM will find it difficult to be competitive
and when beg..end is larger than two registers, we wouldn't use a tree
but a simple linear loop applying a "SIMD min" operation on SIMD-sized
entities (and then finish with the tree-shaped in-register reduction).

So, I guess for VVM to be competitive it would need a bit more magic
than what I remember seeing mentioned in this newsgroup. E.g.
starting with a simpler code like:

char minchar = 255;
while (beg < end)
minchar = min (minchar *beg++);

The CPU (i.e. VVM) could in theory recognize the pattern as a reduction
and `min` as associative and do the "linear large steps followed by
a tree in the end" for you (while still being able to recover the
precise state in case of an exception in the middle). This seems wildly
fancy to me, but then again the kind of "OoO processing with 0-cycle
jumps, hundreds of instructions in flight, instruction fusion, etc" we
see in everyday CPUs used to be in the "batshit complexity" category
some years ago.

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Dec 26 18:35:21 2025

From Newsgroup: comp.arch

On 2025-12-26 4:57 p.m., Thomas Koenig wrote:

(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)

SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).

Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.

You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.

You do a SIMD 8-bit "min" operation. The target c register contains
the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
You then have 64 values.

You then move the upper values of C into register D (which need not
be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
etc.

You then do the min operation with half the length of the registers,
giving you 32 values.

And so on, until you have the single value, which is reached in
seven steps.

This kind of thing is, AFAIK, used in high-performance code such
as JSON parsers or regexp matchers. An example (in principle)
can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .

Just wondering... is this somethig that VVM or similar could
also do? Or does this actually require SIMD and the necessary
shift, rotate or permute intructions?

RISCV IIRC has several reduction operations including min that finds the minimum of all the values in a vector register so I think it does not
need a tree. It should only take two steps. I wonder if other
architectures have the same thing.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Dec 26 17:45:18 2025

From Newsgroup: comp.arch

On 12/26/2025 3:57 PM, Thomas Koenig wrote:

(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)

SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).

In my use-cases, the dominant use cases for SIMD are often one of:
3D or 4D XYZ or XYZW vectors;
RGB or RGBA vectors;
Quaternions;
As a row or column for a 4x4 matrix;
...

Which are also:
Not vectorizing a loop;
Very neatly optimally suited to 4 element vectors.

Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.

You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.

You do a SIMD 8-bit "min" operation. The target c register contains
the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
You then have 64 values.

You then move the upper values of C into register D (which need not
be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
etc.

You then do the min operation with half the length of the registers,
giving you 32 values.

And so on, until you have the single value, which is reached in
seven steps.

This kind of thing is, AFAIK, used in high-performance code such
as JSON parsers or regexp matchers. An example (in principle)
can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .

Just wondering... is this somethig that VVM or similar could
also do? Or does this actually require SIMD and the necessary
shift, rotate or permute intructions?

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Dec 27 07:46:33 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

RISCV IIRC has several reduction operations including min that finds the >minimum of all the values in a vector register

That's very appropriate for a vector extension that allows different SIMD-widths such as the RISC-V vector extensions and ARM SVE. It also
reduces the amount of code at the end of reduction loops and may
reduce the latency (depending on the implementation).

so I think it does not
need a tree.

A reduction loop may have to perform several independent reduction
recurrences in parallel to have a good utilization of the SIMD units.
E.g., if the microarchitecture has an FP addition latency of 3 cycles
and can perform 2 SIMD-width FP additions per cycle (and has enough
other resources, e.g., 2 SIMD-width loads per cycle), 3*2=6
parallel strands of reduction operations are necessary. And then
in the end you combine these 6 strands with a tree to minimize the
latency, and then you can combine the SIMD result with a reduction
instruction.

The approach outlined above makes the code somewhat microarchitecture-dependent; catering for more parallel strands then
necessary does not hurt much, mainly in needing more registers, but if
you have too few strands, the result will be slower than the CPU is
capable of.

An alternative is to use a tree reduction (to SIMD width) inside the
loop at every step. E.g., for an FP addition reduction of 8 SIMD
widths per loop iteration:

/* all r variables and array elements are SIMD width,
see GNU C vector extensions */
r0 = 0;
for (...) {
r1 = a[i+0];
r1 += a[i+1];
r2 = a[i+2];
r2 += a[i+3];
r1 += r2;
r2 = a[i+4];
r2 += a[i+5];
r3 = a[i+6];
r3 += a[i+7];
r2 += r3;
r1 += r2;
r0 += r1; /* the only recurrence in this loop */
}
... /* now reduce r0 to a scalar */

Note that fewer register names are needed than for an 8-strand
reduction loop that offers the same amount of instruction-level
parallelism (ILP).

You can easily increase this to 32 SIMD widths per iteration, which
should be good enough for CPUs in the next decade or two.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 27 21:07:10 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Thomas Koenig <tkoenig@netcologne.de> posted:

(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)

SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).

I use the term "treeification" for this. It is useful in both SW and
HW applications.

Without the logarithmic attempts, the general word used to describe
these things is "reduction".

Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.

You load the values into two 512-bit = 64 byte registers A and B, into positions A<0:7>, A<8:15>, ... A<504:511>, and B correspondingyl.

Since B is A+512 and 512 = -+ of the size, it is akin to how Cooley
and Tukey turned DFT into FFT--AND THIS is the key to many large scale reduction processes. It also happens to be the way to Vectorize FFTs
for event faster performance on machines with vector capabilities.

You do a SIMD 8-bit "min" operation. The target c register contains
the minimum values: min(A<0:7>,B<0:7>), min(A<8:15>,B<8:15>, ... min(A<504:511>,b<504:511>). C need not be distinct from A or B.
You then have 64 values.

You then move the upper values of C into register D (which need not
be distinct from A or B), giving you D<0:7=min(A<256:263>,B<256:263)
etc.

You then do the min operation with half the length of the registers,
giving you 32 values.

And so on, until you have the single value, which is reached in
seven steps.

This kind of thing is, AFAIK, used in high-performance code such
as JSON parsers or regexp matchers. An example (in principle)
can be found at https://gitlab.ethz.ch/extra_projects/fastjson/ .

Just wondering... is this somethig that VVM or similar could
also do? Or does this actually require SIMD and the necessary
shift, rotate or permute intructions?

I will get back to you on this.

After considerable thought::

VVM has the ability to choose execution width (based on HW resources
and based on data recurrence). In the past I have given examples
where VVM is executing at "width" and then because of a memory
"alias" has to drop back to 1-wide until the pointers cross before
reverting back to full width.

This algorithm (reduction) is a modification to dynamic width control,
where width is constant until the final K iterations and then decreases
by -+ each iteration thereafter. So, fundamentally, VVM does not have
a problem with reductions "expressed right".

However, the given problem of 512-bits (64-bytes) might not find much
if any speedup, due to initialization, and a potential stutter step
on each DIV-2 iteration.

It might be better to allow the HW to recognize some inst have
certain properties and integrate those into VVM recognition so
that VVM performs a wide calculation; roughly akin to the
following:

for(...)
local_minimum[ k>>3 ] = MIN( a[k,k+7] ); k+=8;
for(...)
global_minimum = MIN( local_minimum[i,i+7] ); i+=8;

For sizes as small as 512-bits, VVM might not have an advantage. On the
other hand, if HW knew certain things about some instructions, the top
loop might be performed simultaneously with the bottom loop--more or
less like having an adder that performs {8|u64, 16|u32, 32|u16, 64|u8} calculations simultaneously in reduction form {Exact for integer,
Single rounding for 2^n FPs} and this wide adder feeds the second
calculation 1 K|u reduction per cycle in a single merged loop.

Needs more thought. {Known problem}
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Dec 28 10:24:43 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Single rounding for 2^n FPs

Given that usual FP arithmetics does not satisfy the associative law,
one cannot transform easily-written reductions like

double r=0.0
for (...)
r += a[i];

into any more efficient form (e.g., one with several computation
strands or one that uses a complete tree evaluation). Compilers use
the allowance given by -ffast-math to reassociate the computations and
to vectorize such code. Of course, if you implement FP operations in
general without inexact results, your FP operations are associative,
but I fail to see how that can be achieved in general.

In the above, the programmer could use "accumulator" instead of
"double" for the type of r, indicating to the compiler that all the
additions are exact, and only when r is assigned to a double does the
rounding happen. The compiler would compile the + to an
add-to-accumulator instruction, and the VVM hardware, upon seeing such instructions, would know that they follow the associative law and can
be reassociated.

However, my impression has been that you only intend to have one
physical accumulator, which results in the questions of how several
array elements per cycle could be added to this accumulator. Also, if
several strands of recurrences are appropriate, how is that done?

Also, what happens if the loop performs two reductions that would need
the accumulator?

Finally, most code will not be my66000-specific and will use the type
"double" in code as above; ok, this code will be compiled with
-ffast-math if he programmer intends it to be vectorized, but how is
that compiler flag encoded in the nachine code such that VVM knows
that it can reassociate these additions?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 28 14:19:53 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

[...]

Without the logarithmic attempts, the general word used to describe
these things is "reduction".

Ah, yes, of course.

[...]

VVM has the ability to choose execution width (based on HW resources
and based on data recurrence). In the past I have given examples
where VVM is executing at "width" and then because of a memory
"alias" has to drop back to 1-wide until the pointers cross before
reverting back to full width.

This algorithm (reduction) is a modification to dynamic width control,
where width is constant until the final K iterations and then decreases
by -+ each iteration thereafter. So, fundamentally, VVM does not have
a problem with reductions "expressed right".

How would this be expressed? As straightforward serialized code in
a VVM loop? This would suffer from needing memory accesses to
store and reload intermediate results.

[...]

However, the given problem of 512-bits (64-bytes) might not find much
if any speedup, due to initialization, and a potential stutter step
on each DIV-2 iteration.

It might be better to allow the HW to recognize some inst have
certain properties and integrate those into VVM recognition so
that VVM performs a wide calculation; roughly akin to the
following:

for(...)
local_minimum[ k>>3 ] = MIN( a[k,k+7] ); k+=8;
for(...)
global_minimum = MIN( local_minimum[i,i+7] ); i+=8;

An issue here could be the "as if" serial nature of VVM, I think -
if an interrupt occurs, you would have to restore the state.

For sizes as small as 512-bits, VVM might not have an advantage.

Do you mean over SIMD or straightforward serial code? I would be
good if VVM came within shouting distance of SIMD.

On the
other hand, if HW knew certain things about some instructions, the top
loop might be performed simultaneously with the bottom loop--more or
less like having an adder that performs {8|u64, 16|u32, 32|u16, 64|u8} calculations simultaneously in reduction form {Exact for integer,
Single rounding for 2^n FPs} and this wide adder feeds the second
calculation 1 K|u reduction per cycle in a single merged loop.

Needs more thought. {Known problem}

I would guess so...
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 17:48:02 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Single rounding for 2^n FPs

Given that usual FP arithmetics does not satisfy the associative law,
one cannot transform easily-written reductions like

double r=0.0
for (...)
r += a[i];

Anton, I am surprised you have not heard of Quires !!! where the
accumulator is the full exponent and fraction width. Posits have
it, I have it on authority that IEEE 754 is doing something along
this for next time.

Give a Quire, one can accumulate any number of terms and get the
right answer from a single rounding of all terms. Accumulations
into a quire DO satisfy Associativity.

But you are right, given IEEE 754 AS IS you cannot. And this is the
problem.

into any more efficient form (e.g., one with several computation
strands or one that uses a complete tree evaluation). Compilers use
the allowance given by -ffast-math to reassociate the computations and
to vectorize such code. Of course, if you implement FP operations in
general without inexact results, your FP operations are associative,
but I fail to see how that can be achieved in general.

Also, given a Quire, the compiler becomes FREE to reorder arithmetic
terms in a reduction.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Dec 28 18:38:48 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Anton, I am surprised you have not heard of Quires !!! where the
accumulator is the full exponent and fraction width.

...

Also, given a Quire, the compiler becomes FREE to reorder arithmetic
terms in a reduction.

Am I surprised that you did not read all of what I wrote?
Unfortunately not.

Anyway, maybe this time you will read it, and I will spell it out more
clearly:

With two 8-wide SIMD DP FP adders and three cycles of latency, you can
use six strands of SIMD addition (48 strands of scalar addition) to
make full use of the SIMD units.

How many accumulators or (if you really need to use that term) quires
do you have in your architecture and in your microarchitecture; how
many DP FP additions to one accumulator/quire can be started per
cycle? And how long does the next bunch of additions then have to
wait before being started?

If the end result cannot compete with SIMD units on the machines where
the programs run, I fear that the accumulator/quire will stay a
feature for a certain niche of users, while another group of users
will ignore it for performance reasons.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 19:01:58 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Anton, I am surprised you have not heard of Quires !!! where the >accumulator is the full exponent and fraction width.

..

Also, given a Quire, the compiler becomes FREE to reorder arithmetic
terms in a reduction.

Am I surprised that you did not read all of what I wrote?
Unfortunately not.

Anyway, maybe this time you will read it, and I will spell it out more clearly:

With two 8-wide SIMD DP FP adders and three cycles of latency, you can
use six strands of SIMD addition (48 strands of scalar addition) to
make full use of the SIMD units.

How many accumulators or (if you really need to use that term) quires
do you have in your architecture and in your microarchitecture; how
many DP FP additions to one accumulator/quire can be started per
cycle?

In a non-CPU design I am working on (AI) we start 32-FMA (F32) per cycle
that gets reduced into a quire every cycle. 32-FMAs per cycle is the
BW of the HBMs we are going to use. So, every 24-cycles we compute
768 FMAs into a single result. 1024-bits from DRAM per cycle at 4 GHz (0.5TB/s).

Note: this is not a "programmable unit" it is more like a GPU fixed FU dedicated to an AI HSNW workload.

And how long does the next bunch of additions then have to
wait before being started?

1-cycle after the final reduction begins, you can start a new FADD or FACC.

If the end result cannot compete with SIMD units on the machines where
the programs run, I fear that the accumulator/quire will stay a
feature for a certain niche of users, while another group of users
will ignore it for performance reasons.

I will let you reason that one out for yourself.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Dec 28 22:12:30 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Anton, I am surprised you have not heard of Quires !!! where the
accumulator is the full exponent and fraction width.

...

Also, given a Quire, the compiler becomes FREE to reorder arithmetic
terms in a reduction.

Am I surprised that you did not read all of what I wrote?
Unfortunately not.

Anyway, maybe this time you will read it, and I will spell it out more clearly:

With two 8-wide SIMD DP FP adders and three cycles of latency, you can
use six strands of SIMD addition (48 strands of scalar addition) to
make full use of the SIMD units.

How many accumulators or (if you really need to use that term) quires
do you have in your architecture and in your microarchitecture; how
many DP FP additions to one accumulator/quire can be started per
cycle? And how long does the next bunch of additions then have to
wait before being started?

If the end result cannot compete with SIMD units on the machines where
the programs run, I fear that the accumulator/quire will stay a
feature for a certain niche of users, while another group of users
will ignore it for performance reasons.

Mitch have already returned with a unit which can do 32 float aggregations/cycle, so obviously quite fast.

If I had to design such a beast, I would be very tempted to use a
carry-save format, i.e each of the ~1080 bits in the accumulator is
stored as a bit value plus a possible carry from the bit below.

This way each double or float value to be aggregated just needs to be
aligned (per exponent), then a per-bit full adder would require about 5
gates and just 2 gate delays?

This should be sufficient to handle at least 6-8 new values per cycle,
right?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Dec 28 18:51:31 2025

From Newsgroup: comp.arch

On 12/26/2025 1:57 PM, Thomas Koenig wrote:

(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)

SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).

Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.

Thomas, this is an excellent "test case" as it brings out at least two
issues. There has been discussion in this thread about the "reduction" problem. Let me start on the other problem, that I call ALU
underutilization. It is caused by requiring lots of simple operations
on small data elements. For this example, I assume a four wide My 66000.

Lets look at just the first pass. I think the simplest coding would
have the VVM loop consisting of two load instructions, two add
instructions to increment the addresses and a min instruction. Letting
VVM do its magic, this would generate 4 byte min operations at a time,
(one per ALU) and thus the loop would be executed 64/4 = 16 times. I
don't know how your hypothetical SIMD machine would do this, but it
might do all 64 min operations in a single operation, or perhaps 2.
This puts VVM at a substantial performance disadvantage.

I have a possible suggestion to help this. I don't claim it is the best solution.

The problem stems from using only 8 bits of the 64 bit integer ALU for
each operation, leading to more operations. So one possible solution
would be to add a new instruction modifier that tells the system that
any relevant operations under its mask will do the whole register worth
of operations using the size already specified in the the operation.
Since the min instruction would already have specified bytes, with the modification, the instruction would do 8 byte min operations at once,
this reducing the loop count by a factor of 8. Of course, this
generalized to half words and words as well, and to similar "simple" instructions such as add/subtract, etc. Note that this already "fits"
in the existing 64 bit ALUs, with the addition of a little logic to
suppress carries, etc. to allow the simultaneous use of all the ALU bits.

Comments?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 29 19:59:38 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 12/26/2025 1:57 PM, Thomas Koenig wrote:

(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)

SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).

Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.

Thomas, this is an excellent "test case" as it brings out at least two issues. There has been discussion in this thread about the "reduction" problem. Let me start on the other problem, that I call ALU underutilization. It is caused by requiring lots of simple operations
on small data elements. For this example, I assume a four wide My 66000.

Lets look at just the first pass. I think the simplest coding would
have the VVM loop consisting of two load instructions, two add
instructions to increment the addresses and a min instruction. Letting
VVM do its magic, this would generate 4 byte min operations at a time,
(one per ALU) and thus the loop would be executed 64/4 = 16 times. I
don't know how your hypothetical SIMD machine would do this, but it
might do all 64 min operations in a single operation, or perhaps 2.
This puts VVM at a substantial performance disadvantage.

I have a possible suggestion to help this. I don't claim it is the best solution.

The problem stems from using only 8 bits of the 64 bit integer ALU for
each operation, leading to more operations. So one possible solution
would be to add a new instruction modifier that tells the system that
any relevant operations under its mask will do the whole register worth
of operations using the size already specified in the the operation.

This is exactly what VVM does, BTW. Smaller than register widths are
SIMDed into single "units of work" up to register width and performed
with the carry-chains clipped.

Since the min instruction would already have specified bytes,

It is the memory instruction that specifies data width.

with the modification, the instruction would do 8 byte min operations at once,
this reducing the loop count by a factor of 8.

Yes, exactly.

Of course, this
generalized to half words and words as well, and to similar "simple" instructions such as add/subtract, etc.

All that is modified is where the carry-chains get clipped.

Note that this already "fits"
in the existing 64 bit ALUs, with the addition of a little logic to
suppress carries, etc. to allow the simultaneous use of all the ALU bits.

Comments?

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Dec 29 12:14:26 2025

From Newsgroup: comp.arch

On 12/29/2025 11:59 AM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 12/26/2025 1:57 PM, Thomas Koenig wrote:

(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)

SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).

Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.

Thomas, this is an excellent "test case" as it brings out at least two
issues. There has been discussion in this thread about the "reduction"
problem. Let me start on the other problem, that I call ALU
underutilization. It is caused by requiring lots of simple operations
on small data elements. For this example, I assume a four wide My 66000.

Lets look at just the first pass. I think the simplest coding would
have the VVM loop consisting of two load instructions, two add
instructions to increment the addresses and a min instruction. Letting
VVM do its magic, this would generate 4 byte min operations at a time,
(one per ALU) and thus the loop would be executed 64/4 = 16 times. I
don't know how your hypothetical SIMD machine would do this, but it
might do all 64 min operations in a single operation, or perhaps 2.
This puts VVM at a substantial performance disadvantage.

I have a possible suggestion to help this. I don't claim it is the best
solution.

The problem stems from using only 8 bits of the 64 bit integer ALU for
each operation, leading to more operations. So one possible solution
would be to add a new instruction modifier that tells the system that
any relevant operations under its mask will do the whole register worth
of operations using the size already specified in the the operation.

This is exactly what VVM does, BTW. Smaller than register widths are
SIMDed into single "units of work" up to register width and performed
with the carry-chains clipped.

Oh, I didn't realize that VVM already did that. Bravo!

Since the min instruction would already have specified bytes,

It is the memory instruction that specifies data width.

I thought with your latest modifications to the ISA that instructions
like min specified a data width. But using the width specified in the
memory reference instruction seems fine. I can't think of a useful case
where the two would be different.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 29 21:17:53 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 12/29/2025 11:59 AM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 12/26/2025 1:57 PM, Thomas Koenig wrote:

(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)

SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).

Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.

Thomas, this is an excellent "test case" as it brings out at least two
issues. There has been discussion in this thread about the "reduction"
problem. Let me start on the other problem, that I call ALU
underutilization. It is caused by requiring lots of simple operations
on small data elements. For this example, I assume a four wide My 66000. >>
Lets look at just the first pass. I think the simplest coding would
have the VVM loop consisting of two load instructions, two add
instructions to increment the addresses and a min instruction. Letting
VVM do its magic, this would generate 4 byte min operations at a time,
(one per ALU) and thus the loop would be executed 64/4 = 16 times. I
don't know how your hypothetical SIMD machine would do this, but it
might do all 64 min operations in a single operation, or perhaps 2.
This puts VVM at a substantial performance disadvantage.

I have a possible suggestion to help this. I don't claim it is the best >> solution.

The problem stems from using only 8 bits of the 64 bit integer ALU for
each operation, leading to more operations. So one possible solution
would be to add a new instruction modifier that tells the system that
any relevant operations under its mask will do the whole register worth
of operations using the size already specified in the the operation.

This is exactly what VVM does, BTW. Smaller than register widths are
SIMDed into single "units of work" up to register width and performed
with the carry-chains clipped.

Oh, I didn't realize that VVM already did that. Bravo!

It is how the HW works::

An integer adder is comprised of 8|u9-bit sections. You feed 8-data
bits into each section, and you feed into 9th-bit::
00 if you want the carry clipped,
01 if you want the carry propagated,
11 if you want a carry generated for next section;

An adder comprised of 9-bit sections is no more gates of delay than one comprised of 8-bit sections.

Since the min instruction would already have specified bytes,

It is the memory instruction that specifies data width.

I thought with your latest modifications to the ISA that instructions
like min specified a data width. But using the width specified in the memory reference instruction seems fine. I can't think of a useful case where the two would be different.

Whereas the current ISA does have size/calculation, I have not had the time
to go back and examine vVM to the extent necessary to make any statements
on how vVM would work with these.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Jan 1 09:51:17 2026

From Newsgroup: comp.arch

On 12/29/2025 1:17 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 12/29/2025 11:59 AM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 12/26/2025 1:57 PM, Thomas Koenig wrote:

(This might be blindingly obvious to most regulars, but I thought
I'd post this, just in case for some discussion)

SIMD is not always about vectorizing loops, they can also be used
for tree-shaped reductions (not sure what the canonical name is).

Consider the following problem: You have 128 consecutive bytes and
want to find the minimum value, and you have 512-bit SIMD registers.

Thomas, this is an excellent "test case" as it brings out at least two >>>> issues. There has been discussion in this thread about the "reduction" >>>> problem. Let me start on the other problem, that I call ALU
underutilization. It is caused by requiring lots of simple operations >>>> on small data elements. For this example, I assume a four wide My 66000. >>>>
Lets look at just the first pass. I think the simplest coding would
have the VVM loop consisting of two load instructions, two add
instructions to increment the addresses and a min instruction. Letting >>>> VVM do its magic, this would generate 4 byte min operations at a time, >>>> (one per ALU) and thus the loop would be executed 64/4 = 16 times. I
don't know how your hypothetical SIMD machine would do this, but it
might do all 64 min operations in a single operation, or perhaps 2.
This puts VVM at a substantial performance disadvantage.

I have a possible suggestion to help this. I don't claim it is the best >>>> solution.

The problem stems from using only 8 bits of the 64 bit integer ALU for >>>> each operation, leading to more operations. So one possible solution
would be to add a new instruction modifier that tells the system that
any relevant operations under its mask will do the whole register worth >>>> of operations using the size already specified in the the operation.

This is exactly what VVM does, BTW. Smaller than register widths are
SIMDed into single "units of work" up to register width and performed
with the carry-chains clipped.

Oh, I didn't realize that VVM already did that. Bravo!

It is how the HW works::

An integer adder is comprised of 8|u9-bit sections. You feed 8-data
bits into each section, and you feed into 9th-bit::
00 if you want the carry clipped,
01 if you want the carry propagated,
11 if you want a carry generated for next section;

An adder comprised of 9-bit sections is no more gates of delay than one comprised of 8-bit sections.

I understand that can be done fairly easily. A minor point you seem to
be specifying a two bit value into a single (the ninth) bit.

But my astonishment is that the HW as you say "SIMD"s eight byte
operations into a single 64 bit operation. Great for efficiency, but it
seems to violate the syntax of what was in the source code, which was
single byte operations. For Min, that part isn't an issue (but see
below), but the syntax specified ending up with a single byte. Does the
hardware "automagically" do some hidden min operations among the eight
bytes in the register to end up with a single byte?

Note also that this works well for min (and max), but for add, etc. if
the data is unsigned only if you don't care where in the sequence an
overflow occurred, but but if the data is signed, only if you don't even
care if overflow occurred.

Since the min instruction would already have specified bytes,

It is the memory instruction that specifies data width.

I thought with your latest modifications to the ISA that instructions
like min specified a data width. But using the width specified in the
memory reference instruction seems fine. I can't think of a useful case
where the two would be different.

Whereas the current ISA does have size/calculation, I have not had the time to go back and examine vVM to the extent necessary to make any statements
on how vVM would work with these.

Fair enough!
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Jan 1 22:21:36 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Note also that this works well for min (and max), but for add, etc. if
the data is unsigned only if you don't care where in the sequence an >overflow occurred, but but if the data is signed, only if you don't even >care if overflow occurred.

Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.

If you have a sticky overflow bit (Power has something in that
direction), or if you trap on overflow (MIPS and Alpha have such
instructions for signed addition), you will certainly notice if an
overflow occured for signed additions for the particular evaluation
used on the machine. If there are evaluations with and without
overflow, I think that those without are preferable (although probably
not enough to pay the performance cost for ensuring that).

Which programming language are you thinking of where reduction uses
some addition with overflow detection.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Jan 1 22:15:56 2026

From Newsgroup: comp.arch

On 1/1/2026 2:21 PM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Note also that this works well for min (and max), but for add, etc. if
the data is unsigned only if you don't care where in the sequence an
overflow occurred, but but if the data is signed, only if you don't even
care if overflow occurred.

Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.

Yes. You have given a more precise mathematical statement of of what I
said above.

If you have a sticky overflow bit (Power has something in that
direction), or if you trap on overflow (MIPS and Alpha have such
instructions for signed addition), you will certainly notice if an
overflow occured for signed additions for the particular evaluation
used on the machine. If there are evaluations with and without
overflow, I think that those without are preferable (although probably
not enough to pay the performance cost for ensuring that).

Which programming language are you thinking of where reduction uses
some addition with overflow detection.

I wasn't thinking of any programming language in particular. I was
trying to clarify my "astonishment" and work through the details of what
Mitch said earlier about SIMDifying the calculations in the context of
VVM. I was trying to understand exactly what the hardware was doing in
the various cases.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jan 2 09:19:44 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

I was
trying to clarify my "astonishment" and work through the details of what >Mitch said earlier about SIMDifying the calculations in the context of
VVM. I was trying to understand exactly what the hardware was doing in
the various cases.

I don't know how MY66000 deals with overflow detection. I assume that
MY66000 has instructions that add modulo 2^n. If the code uses these instructions (as code compiled from typical C, C++, or Java code
does), this is no obstacle to reassociation and to vectorizing the
reduction.

A bigger problem may be that in C (and a number of other languages)
all shorter integers are implicitly converted to int before any
operation, and the user is likely to use at least int, but maybe even
long for the accumulator in the reduction:

/* int8_t a[n]; */
long r=0;
for (i=0; i<n; i++)
r += a[i];

Here the compiler will compile the a[i] into an sign-extending load
and the += into a 64-bit addition. This means that any kind of auto-vectorization (whether the existing one using SIMD instruction or
the hardware implementation in VVM will have to do the sign-extending
expansion to 64-bit values, reducing the speedup. In the unlikely
case that the programmers are interested in a result modulo 2^8, they
can write

int8_t r=0;

instead (and compile with -fwrapv, which I recommend in any case); I
think that MY66000 has width on computation operations, which I
usually consider a waste for encoding bits, but in connection with VVM
it works out: The compiler can optimize the "r += ..." into an
addition modulo 2^8, and VVM can then use parallel 8-bit additions to
speed up the loop part of the reduction.

If somebody writes

/* int8_t a[n]; */
int r=INT_MAX;
for (i=0; i<n; i++)
r = min(r,a[i]);

Then a sufficiently advanced compiler could conclude that, as soon as

0, the range of r is the range of int8_n and also use 8-bit min in

that case.

One would then need such an advanced compiler to produce an 8-bit min instruction for MY66000, and the VVM could achieve the same speedup.
But the compiler does not really get simpler.

Alternatively, the programmer could write

int8_t r=0;

and the compiler would need to be less advanced to vectorize it using
SIMD instructions, and even less advaned for VVM.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 2 18:52:29 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 1/1/2026 2:21 PM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Note also that this works well for min (and max), but for add, etc. if
the data is unsigned only if you don't care where in the sequence an
overflow occurred, but but if the data is signed, only if you don't even >> care if overflow occurred.

Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.

Yes. You have given a more precise mathematical statement of of what I
said above.

Just ANNOTHER reason the default integer should be UNSIGNED.

If you have a sticky overflow bit (Power has something in that
direction), or if you trap on overflow (MIPS and Alpha have such instructions for signed addition), you will certainly notice if an
overflow occured for signed additions for the particular evaluation
used on the machine. If there are evaluations with and without
overflow, I think that those without are preferable (although probably
not enough to pay the performance cost for ensuring that).

Which programming language are you thinking of where reduction uses
some addition with overflow detection.

I wasn't thinking of any programming language in particular. I was
trying to clarify my "astonishment" and work through the details of what Mitch said earlier about SIMDifying the calculations in the context of
VVM. I was trying to understand exactly what the hardware was doing in
the various cases.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 2 18:52:50 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 1/1/2026 2:21 PM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Note also that this works well for min (and max), but for add, etc. if
the data is unsigned only if you don't care where in the sequence an
overflow occurred, but but if the data is signed, only if you don't even >> care if overflow occurred.

Addition with trapping on overflow or with overflow detection is not associative. Addition modulo 2^n is.

Yes. You have given a more precise mathematical statement of of what I
said above.

Just ANNOTHER reason the default integer should be UNSIGNED.

If you have a sticky overflow bit (Power has something in that
direction), or if you trap on overflow (MIPS and Alpha have such instructions for signed addition), you will certainly notice if an
overflow occured for signed additions for the particular evaluation
used on the machine. If there are evaluations with and without
overflow, I think that those without are preferable (although probably
not enough to pay the performance cost for ensuring that).

Which programming language are you thinking of where reduction uses
some addition with overflow detection.

I wasn't thinking of any programming language in particular. I was
trying to clarify my "astonishment" and work through the details of what Mitch said earlier about SIMDifying the calculations in the context of
VVM. I was trying to understand exactly what the hardware was doing in
the various cases.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 2 19:01:29 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

I was
trying to clarify my "astonishment" and work through the details of what >Mitch said earlier about SIMDifying the calculations in the context of >VVM. I was trying to understand exactly what the hardware was doing in >the various cases.

I don't know how MY66000 deals with overflow detection.

Any significance beyond the signed result container size.

However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when
2-operands have one sign bit and the other has the opposite sign bit ??

In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??

I assume that MY66000 has instructions that add modulo 2^n.

Called, surprisingly, Unsigned.

If the code uses these instructions (as code compiled from typical C, C++, or Java code
does), this is no obstacle to reassociation and to vectorizing the
reduction.

Unsigned have a lot fewer surprised than Signed.

A bigger problem may be that in C (and a number of other languages)
all shorter integers are implicitly converted to int before any
operation,

No, unsigned char is converted to unsigned int not int.
up conversions maintain signedness.

and the user is likely to use at least int, but maybe even
long for the accumulator in the reduction:

----------------

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jan 2 19:29:41 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when >2-operands have one sign bit and the other has the opposite sign bit ??

In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??

I doubt it. If you really care for overflow, just produce a 66-bit
result, and check if it is in the range [-2^63,2^63).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Jan 2 14:52:02 2026

From Newsgroup: comp.arch

However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when 2-operands have one sign bit and the other has the opposite sign bit ??

In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??

How 'bout sign-extending your 64bits to 65bits, then doing your ADD3 and
then making sure the top 2 bits of the result are equal?

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Jan 2 18:56:32 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when 2-operands have one sign bit and the other has the opposite sign bit ??

In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??

res1 = a + b
ovr1 = (res1<63> ^ a<63>) & (res1<63> ^ b<63>)
res2 = res1 + c
ovr2 = ((res2<63> ^ res1<63>) & (res2<63> ^ c<63>)) | ovr1

This detects if either (a+b) or (res1+c) overflows but also
signals overflow if (a+b) overflows but then +c wraps it back in range.
That is, a = INT_MAX, b = 1, c = -1, then a+b+c == INT_MAX
but the above would set ovr1 = 1 because a+b overflowed.

This unnecessary transitory overflow (one interior to an expression)
can be avoided by sign extending a,b,c to 66 bits, doing the add,
and checking result<65:63> are all 1 or 0.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jan 3 01:49:10 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when >2-operands have one sign bit and the other has the opposite sign bit ??

In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??

I doubt it. If you really care for overflow, just produce a 66-bit
result, and check if it is in the range [-2^63,2^63).

Which is why the spec is written "significance above container size"
That is: 'above container' not equal 0 or -1.

Works for + with CARRY, *, 128-bit->64-bit DIV, shifts, ...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jan 3 01:50:37 2026

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> posted:

However, last week someone suggested a 3-input addition, which I gave
some thought--Exactly what is the overflow condition of a ADD3 when 2-operands have one sign bit and the other has the opposite sign bit ??

In the 2-input case:: result<63> != Operand1<63> ^ Operand2<63>
Is there a similar 3-input equation ??

How 'bout sign-extending your 64bits to 65bits, then doing your ADD3 and
then making sure the top 2 bits of the result are equal?

In HW one can simply setup a data-path and check <127..64> = 0 or -1.
Remember I have CARRY on this data-path.

Stefan

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Widgit
  Sun Jan 11 18:29:52 2026
  from New Zealand via Telnet
- Geek2
  Sun Jan 11 14:07:03 2026
  from Euclid, Oh via Telnet
- Geek2
  Sun Jan 11 11:15:24 2026
  from Euclid, Oh via Telnet
- Crackerchest
  Sun Jan 11 08:12:39 2026
  from Usa via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	54
Nodes:	6 (0 / 6)
Uptime:	17:43:49
Calls:	742
Files:	1,218
D/L today:	4 files (8,203K bytes)
Messages:	184,414
Posted today:	1

A typical non-loop use case for SIMD

Who's Online

Recent Visitors

System Info