Forum: Too Lazy BBS

Re: Matmul in VVM

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 25 12:00:52 2026

From Newsgroup: comp.arch

On 5/3/2026 3:28 PM, MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}

C version loop invariant, and cursoring

#define N 8
void mm8(double *a, double *b, double *c)
{
int i,j,jN,k,kN;
double *AcijN,*AbkjN,*AaijN;

for( jN=0; jN<N*N; jN+=N ) {
AcijN = &c[jN];
AbkjN = &b[jN];
for( kN=k=0; k<N; k++,kN+=N ) {
AaikN = &a[kN];
bN = AbkjN[k];
for( i=0; i<N; i++ ) {
AcijN[i] += AaikN[i] * bN;
}
}
}
}

I did get this into:

mm8:
; R1 = &a[0];
; R2 = &b[0];
; R3 = &c[0]; -------------------------------------------------------------
MOV RjN,#0 ; R4
loop1:
LA RcijN,[Rc,RjN<<3] ; R5
LA RbkjN,[Rb,RjN<<3] ; R6 -------------------------------------------------------------
MOV RkN,#0 ; R7
MOV Rk,#0 ; R8
loop2:
LA RaikN,[Ra,RkN<<3] ; R9
LDD RbN,[RbkjN,Rk<<3] ; R10 -------------------------------------------------------------
MOV Ri,#0 ; R11
VEC 8,{}
loop3:
LDD Ra,[RaikN,Ri<<3] ; R12
LDD Rc,[RcijN,Ri<<3] ; R13
FMAC Rc,Ra,Rb,Rc ; R14
STD Rc,[RcijN,Ri<<3] ;

LOOP1 LE,Ri,#1,#8 ; R11 -------------------------------------------------------------
ADD Rk,Rk,#1 ; R8
ADD RkN,RkN,#8 ; R7
CMP Rt,Rk,#8 ; R11
BLE Rt,loop2 -------------------------------------------------------------
ADD RjN,RjN,#8 ; R4
CMP Rt,RjN,#64 ; R7
BLE Rt,Loop1 -------------------------------------------------------------
RET

without needing any preserved registers.

b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
64 double reads for b. For a and c are 512 reads of doubles each,
512 doubles are written for c. Total, 1600 memory access for doubles.

By comparison, the SIMD code reads 192 doubles and writes 64, the
minimum, for a total of 256. This is a factor of 6.25.

It occurs to me that c[*] should be set to zero for a "real" matrix multiply...as is c[*] is both input and output.

----------------------------------

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
double c0 = c[0 + j*N];
double c1 = c[1 + j*N];
double c2 = c[2 + j*N];
double c3 = c[3 + j*N];
double c4 = c[4 + j*N];
double c5 = c[5 + j*N];
double c6 = c[6 + j*N];
double c7 = c[7 + j*N];
for (int k=0; k<N; k++) {
double bk = b[k + j*N];
c0 += a[0 + k*N] * bk;
c1 += a[1 + k*N] * bk;
c2 += a[2 + k*N] * bk;
c3 += a[3 + k*N] * bk;
c4 += a[4 + k*N] * bk;
c5 += a[5 + k*N] * bk;
c6 += a[6 + k*N] * bk;
c7 += a[7 + k*N] * bk;
}
/* write back c0 to c7 */
}
}
}

where the loop over k could be vectorized, but that would still
leave eccessive memory traffic for a.

ENTER Rc1,Rc8,#0 ; preserve c[1..8]
MOV RjN,#0 ; R4
loop1:
LA Rca,[Rc,RjN<<3] ; &c[1..8]
LDD Rc1,[Rca,#0] ; R23
LDD Rc2,[Rca,#8]
LDD Rc3,[Rca,#16]
LDD Rc4,[Rca,#24]
LDD Rc5,[Rca,#32]
LDD Rc6,[Rca,#40]
LDD Rc7,[Rca,#48]
LDD Rc8,[Rca,#56] ; R30

MOV RkN,#0 ; R5
---------------begin vectorize-------------------
VEC 8,{Rc1..Rc8}
loop2:
LDD Rbk,[R2,RjN<<3] ; R6

LA RakN,[Ra,RkN<<3] ; R7
LDD Ra1,[RakN,#0] ; R8
FMAC Rc1,Ra1,Rbk,Rc1 ; R23
LDD Ra2,[RakN,#8] ; R7
FMAC Rc2,Ra2,Rbk,Rc2 ; R24
LDD Ra3,[RakN,#16] ; R7
FMAC Rc3,Ra3,Rbk,Rc3 ; R25
LDD Ra4,[RakN,#24] ; R7
FMAC Rc4,Ra4,Rbk,Rc4 ; R26
LDD Ra5,[RakN,#32] ; R7
FMAC Rc5,Ra2,Rbk,Rc6 ; R27
LDD Ra6,[RakN,#40] ; R7
FMAC Rc6,Ra2,Rbk,Rc6 ; R28
LDD Ra7,[RakN,#48] ; R7
FMAC Rc7,Ra2,Rbk,Rc7 ; R29
LDD Ra8,[RakN,#56] ; R7
FMAC Rc8,Ra8,Rbk,Rc8 ; R30

LOOP1 LE,RkN,#8,#64 ; R4
---------------end vectorize-------------------

ADD RkN,RkN,#8 ; R5
CMP Rt,RkN,$64 ; R6
BLE Rt,loop1

STD Rc1,[Rca,#0]
STD Rc2,[Rca,#8]
STD Rc3,[Rca,#16]
STD Rc4,[Rca,#24]
STD Rc5,[Rca,#32]
STD Rc6,[Rca,#40]
STD Rc7,[Rca,#48]
STD Rc8,[Rca,#56]

EXIT Rc1,Rc8,#0
RET

46 instructions 19 instructions in vectorized (unrolled) loop.

c[k] is read once and written once
b[k] is read 8|u
a[k] is read 8|u

If you are willing to have 64 FMACs in a row; a[k] can be read 2|u
{with very tr1cky register allocation}.

Using this many registers causes 64 bytes to be written to stack
and read back later. Solving the a[k] traffic increases the stack
footprint to 104 bytes.

The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

I have been thinking about this and have come up with another potential solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by allowing a sort of register indexing equivalent, I think it could make a substantial reduction in the memory traffic. There are two parts to
this proposal.

First, is to enhance the TT instruction, using the currently unused bit combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now. After execution of that instruction,
which must not be a control transfer instruction, control is returned to
the instruction after the TT instruction. This makes the TT instruction behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target address would be eight FMAC instructions, each using a different
register(s). Of course, I realize that this adds an "extra" instruction execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do anything about extra instruction execution, but see below.

So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU, thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is,once the Loop
instruction has been encountered, you know how many of the executed instructions can fir in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in
this case it eliminates 7/8, i.e. 87.5% of them.

So overall, I think this idea reduces the memory traffic cost of keeping
the A matrix in registers by a huge amount. It also eliminates any
"mucking around" with the OoO mechanism to handle not knowing which
registers are involved at instruction decode time that my previous idea had.

As I said, I am sure there are issues with this. I welcome your comments.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 19:36:14 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 5/3/2026 3:28 PM, MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

----------------------------

I have been thinking about this and have come up with another potential solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by allowing a sort of register indexing equivalent, I think it could make a substantial reduction in the memory traffic. There are two parts to
this proposal.

First, is to enhance the JTT instruction, using the currently unused bit combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now.

A minor issue is the out-of-sequence Fetch problem this introduces.
Not un-workable, but an annoyance. Perhaps Call-through-Table would
work better--let me think on it.

After execution of that instruction,
which must not be a control transfer instruction, control is returned to
the instruction after the TT instruction. This makes the TT instruction behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target address would be eight FMAC instructions, each using a different register(s).

This is why a CTT is better, you can perform multiple instructions before returning.

Of course, I realize that this adds an "extra" instruction execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do anything about extra instruction execution, but see below.

So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU,

A very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an
index field that matches on the Loop iteration index.

thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is, once the Loop instruction has been encountered, you know how many of the executed instructions can fit in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in this case it eliminates 7/8, i.e. 87.5% of them.

Eliminates = 1.0-1.0/(loop_count)

So overall, I think this idea reduces the memory traffic cost of keeping
the A matrix in registers by a huge amount.

The issue is that there can be no-transfers-of-control* out of a vVM
loop while remaining IN a vVM loop. {This simplified HW by enormous
amounts. vVM only vectorizes the innermost loop.

(*) Predicated flow control is allowed, but branches/calls/SVC are not.

It also eliminates any
"mucking around" with the OoO mechanism to handle not knowing which registers are involved at instruction decode time that my previous
idea had.

So does register[indexing] and out-of-line instruction execution.

As I said, I am sure there are issues with this. I welcome your comments.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 25 17:52:51 2026

From Newsgroup: comp.arch

On 5/25/2026 12:36 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 5/3/2026 3:28 PM, MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

----------------------------

I have been thinking about this and have come up with another potential
solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by
allowing a sort of register indexing equivalent, I think it could make a
substantial reduction in the memory traffic. There are two parts to
this proposal.

First, is to enhance the JTT instruction, using the currently unused bit
combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now.

A minor issue is the out-of-sequence Fetch problem this introduces.
Not un-workable, but an annoyance. Perhaps Call-through-Table would
work better--let me think on it.

I understand. My rationale for using a different ROB value was to allow
that functionality to be allowed within a VVM loop, whereas the other
values invoke a jump or call, which you don't want to allow in VVM. In
other words an indication that this is a small exception to the no
control transfer rule and should be allowed. But obviously you
understand the internals better than I do.

After execution of that instruction,
which must not be a control transfer instruction, control is returned to
the instruction after the TT instruction. This makes the TT instruction
behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target
address would be eight FMAC instructions, each using a different
register(s).

This is why a CTT is better, you can perform multiple instructions before returning.

I understand. If you want to go there within VVM fine. I was trying to
avoid that.

Of course, I realize that this adds an "extra" instruction
execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do
anything about extra instruction execution, but see below.

So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU,

A very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an
index field that matches on the Loop iteration index.

thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is, once the Loop
instruction has been encountered, you know how many of the executed
instructions can fit in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in
this case it eliminates 7/8, i.e. 87.5% of them.

Eliminates = 1.0-1.0/(loop_count)

??? I thought you would fetch the "executed" instruction once and have
it internally for the next seven loop iterations. But I may be wrong.

So overall, I think this idea reduces the memory traffic cost of keeping
the A matrix in registers by a huge amount.

The issue is that there can be no-transfers-of-control* out of a vVM
loop while remaining IN a vVM loop. {This simplified HW by enormous
amounts. vVM only vectorizes the innermost loop.

I agree. This would have to be an exception, which is why I thought a
unique value for ROP could indicate that.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 26 17:48:56 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 5/25/2026 12:36 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 5/3/2026 3:28 PM, MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

----------------------------

I have been thinking about this and have come up with another potential
solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by
allowing a sort of register indexing equivalent, I think it could make a >> substantial reduction in the memory traffic. There are two parts to
this proposal.

First, is to enhance the JTT instruction, using the currently unused bit >> combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now.

A minor issue is the out-of-sequence Fetch problem this introduces.
Not un-workable, but an annoyance. Perhaps Call-through-Table would
work better--let me think on it.

I understand. My rationale for using a different ROB value was to allow that functionality to be allowed within a VVM loop, whereas the other
values invoke a jump or call, which you don't want to allow in VVM. In other words an indication that this is a small exception to the no
control transfer rule and should be allowed. But obviously you
understand the internals better than I do.

After execution of that instruction,
which must not be a control transfer instruction, control is returned to >> the instruction after the TT instruction. This makes the TT instruction >> behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target
address would be eight FMAC instructions, each using a different
register(s).

This is why a CTT is better, you can perform multiple instructions before returning.

I understand. If you want to go there within VVM fine. I was trying to avoid that.

Of course, I realize that this adds an "extra" instruction >> execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do >> anything about extra instruction execution, but see below.

So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU,

A very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an index field that matches on the Loop iteration index.

thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is, once the Loop
instruction has been encountered, you know how many of the executed
instructions can fit in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in >> this case it eliminates 7/8, i.e. 87.5% of them.

Eliminates = 1.0-1.0/(loop_count)

??? I thought you would fetch the "executed" instruction once and have
it internally for the next seven loop iterations. But I may be wrong.

Once the loop is in the reservation stations, they stay there for the
entire execution of the loop! while FETCH remains quiescent.

So overall, I think this idea reduces the memory traffic cost of keeping >> the A matrix in registers by a huge amount.

The issue is that there can be no-transfers-of-control* out of a vVM
loop while remaining IN a vVM loop. {This simplified HW by enormous amounts. vVM only vectorizes the innermost loop.

I agree. This would have to be an exception, which is why I thought a unique value for ROP could indicate that.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue May 26 11:55:11 2026

From Newsgroup: comp.arch

On 5/26/2026 10:48 AM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 5/25/2026 12:36 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 5/3/2026 3:28 PM, MitchAlsup wrote:

snip

A very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an
index field that matches on the Loop iteration index.

thus allowing multiple iterations of the loop >>>> without multiple I-cache accesses. So the idea is, once the Loop
instruction has been encountered, you know how many of the executed
instructions can fit in the remaining space (in this case, I think all >>>> of them), and where to start them within this memory, (right after the >>>> loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in >>>> this case it eliminates 7/8, i.e. 87.5% of them.

Eliminates = 1.0-1.0/(loop_count)

??? I thought you would fetch the "executed" instruction once and have
it internally for the next seven loop iterations. But I may be wrong.

Once the loop is in the reservation stations, they stay there for the
entire execution of the loop! while FETCH remains quiescent.

Yes. That is what I thought. This would be an exception to that in
that you would have to fetch and put into the reservation stations, the
eight instructions pointed to be the TT instruction. It is those eight fetches out of the 64 executions of those instructions that led me to
the (64-8)/64 = 7/8 reduction in the extra fetches that would otherwise
be required.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed May 27 10:57:01 2026

From Newsgroup: comp.arch

Stephen Fuld [2026-05-26 11:55:11] wrote:

Yes. That is what I thought. This would be an exception to that in
that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those
eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.

Oh, I think I understand your proposal. You want a kind of predication
but instead of being a "predication-style `if`" it's a "predication-style `case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out
(i.e. is executed) depends on the value in register Rn?

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed May 27 11:13:26 2026

From Newsgroup: comp.arch

On 5/27/2026 7:57 AM, Stefan Monnier wrote:

Stephen Fuld [2026-05-26 11:55:11] wrote:

Yes. That is what I thought. This would be an exception to that in
that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those
eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.

Oh, I think I understand your proposal. You want a kind of predication
but instead of being a "predication-style `if`" it's a "predication-style `case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out (i.e. is executed) depends on the value in register Rn?

Yes, with the exception that they don't have to immediately follow what
you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the
reservations stations, the result is exactly as you describe. And, of
course, you don't have to "skip over" the instructions not executed, you
just use the M value to choose which of the instructions to execute.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu May 28 07:43:47 2026

From Newsgroup: comp.arch

On 5/27/2026 11:13 AM, Stephen Fuld wrote:

On 5/27/2026 7:57 AM, Stefan Monnier wrote:

Stephen Fuld [2026-05-26 11:55:11] wrote:

Yes.-a That is what I thought.-a This would be an exception to that in
that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction.-a It is those
eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.

Oh, I think I understand your proposal.-a You want a kind of predication
but instead of being a "predication-style `if`" it's a "predication-style
`case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out
(i.e. is executed) depends on the value in register Rn?

Yes, with the exception that they don't have to immediately follow what
you call the NATPRED instruction (which is actually a modified TT instruction).-a But once the instructions are loaded into the
reservations stations, the result is exactly as you describe.-a And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get
predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
a VVM loop rule.

And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.

Thank you Stefan for helping me see this.

So now the question is, does this save enough to be worth implementing?
I don't know enough to write prototype code for say MATMUL (8) to see
how much the savings would be, nor whether this mechanism could help in
other functions. Mitch? Thomas?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu May 28 07:53:59 2026

From Newsgroup: comp.arch

On 5/28/2026 7:43 AM, Stephen Fuld wrote:

On 5/27/2026 11:13 AM, Stephen Fuld wrote:

On 5/27/2026 7:57 AM, Stefan Monnier wrote:

Stephen Fuld [2026-05-26 11:55:11] wrote:

Yes.-a That is what I thought.-a This would be an exception to that in >>>> that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction.-a It is those >>>> eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.

Oh, I think I understand your proposal.-a You want a kind of predication >>> but instead of being a "predication-style `if`" it's a "predication-
style
`case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out >>> (i.e. is executed) depends on the value in register Rn?

Yes, with the exception that they don't have to immediately follow
what you call the NATPRED instruction (which is actually a modified TT
instruction).-a But once the instructions are loaded into the
reservations stations, the result is exactly as you describe.-a And, of
course, you don't have to "skip over" the instructions not executed,
you just use the M value to choose which of the instructions to execute.

Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code.-a This allows the functionality to make use of the already existing mechanism to get predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
a VVM loop rule.

And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception

Sorry, "extension", not "exception"

would be to allow a

third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc.-a I
don't know how useful this sort of thing would be.

Thank you Stefan for helping me see this.

So now the question is, does this save enough to be worth implementing?
I don't know enough to write prototype code for say MATMUL (8) to see
how much the savings would be, nor whether this mechanism could help in other functions.-a Mitch? Thomas?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 28 18:08:09 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 5/27/2026 11:13 AM, Stephen Fuld wrote:

On 5/27/2026 7:57 AM, Stefan Monnier wrote:

Stephen Fuld [2026-05-26 11:55:11] wrote:

Yes.-a That is what I thought.-a This would be an exception to that in >>> that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction.-a It is those >>> eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.

Oh, I think I understand your proposal.-a You want a kind of predication >> but instead of being a "predication-style `if`" it's a "predication-style >> `case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out >> (i.e. is executed) depends on the value in register Rn?

Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction).-a But once the instructions are loaded into the
reservations stations, the result is exactly as you describe.-a And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
a VVM loop rule.

The PRED instruction (when used in vVM loop} produces a lane mask so
that different iterations of the loop are executed from the same
starting time.

And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.

That is what the then-clause and else-clause 'numbers' do--they set
up the vertical lane-mask for that iteration. And each lane calculates
its own lane mask.

Thank you Stefan for helping me see this.

So now the question is, does this save enough to be worth implementing?
I don't know enough to write prototype code for say MATMUL (8) to see
how much the savings would be, nor whether this mechanism could help in other functions. Mitch? Thomas?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu May 28 19:29:11 2026

From Newsgroup: comp.arch

Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose
which of the instructions to execute.

In vVM, all the instructions that make up the loop end up all placed in
the "dataflow" core once forall after which and they are just triggered
several times until the loop exit condition is satisfied.

So "skip over" makes no sense in this context (also because you want
a much shorter delay between "M is known" and "the corresponding
instruction is executed", so you have to decode the instruction(s)
before M is known).

But instead of skipping, You can "predicate away" the undesirable
instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri May 29 08:50:17 2026

From Newsgroup: comp.arch

On 5/28/2026 11:08 AM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

snip

Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your
understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get
predicated instructions into the reservation stations and eliminates the
"jump" characteristic which required an exception to the no jump within
a VVM loop rule.

The PRED instruction (when used in vVM loop} produces a lane mask so
that different iterations of the loop are executed from the same
starting time.

And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.

That is what the then-clause and else-clause 'numbers' do--they set
up the vertical lane-mask for that iteration. And each lane calculates
its own lane mask.

Yes. But the proposed enhancement to my original proposal gives you
another way to specify which instructions get executed. Your original
PRED has a fixed mask to choose which instructions are executed or not,
based on a binary condition test. This enhancement allows choosing
which instruction gets executed in each lane based on the value in a
register. Thus lane 1 could execute instruction 3, lane 2 could execute instruction 5, etc. This is what allows you to gain the effect of
"indexed register accesses" at low cost (I believe one extra cycle.)

My original proposal allows you to execute one instruction based on the
value in a register, i.e. if the register contains the value 3, then the
third instruction is the only one executed. The enhanced version allows
more flexibility. For example, you could allow to specify execute all instructions up to the number in the register. As I said, while the use
case for the basic instruction is clear, emulating register indexing, I
am not sure there are any use cases for the enhancement.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri May 29 09:03:25 2026

From Newsgroup: comp.arch

On 5/28/2026 4:29 PM, Stefan Monnier wrote:

Yes, with the exception that they don't have to immediately follow what you >> call the NATPRED instruction (which is actually a modified TT instruction). >> But once the instructions are loaded into the reservations stations, the
result is exactly as you describe. And, of course, you don't have to "skip >> over" the instructions not executed, you just use the M value to choose
which of the instructions to execute.

In vVM, all the instructions that make up the loop end up all placed in
the "dataflow" core once forall after which and they are just triggered several times until the loop exit condition is satisfied.

Yes. Note that in an earlier post, I changed my proposal to be more in
line with your ideas and the original pred - specifically, the
instructions to be conditionally executed are physically placed inline,
after the NATPRED.

So "skip over" makes no sense in this context (also because you want
a much shorter delay between "M is known" and "the corresponding
instruction is executed", so you have to decode the instruction(s)
before M is known).

My exposition was sloppy. :-(

But instead of skipping, You can "predicate away" the undesirable instructions.

Agree. And clearer exposition.

So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

Yes, size is clearly a limitation. I think it works for MATMUL (8), but probably not for MATMUL (16). I don't know enough to know how practical
it would be to allow more than the current 32 instructions within a VVM
loop. As for performance, I expect (hope) that instructions other than
the one whose position matches the register value will not be executed.
If that can be done, then the extra cost is presumably one cycle, and it
saves (in the MATUL (8) example), executing eight load instructions.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 29 16:55:38 2026

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> posted:

Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

In vVM, all the instructions that make up the loop end up all placed in
the "dataflow" core once forall after which and they are just triggered several times until the loop exit condition is satisfied.

Correct.

So "skip over" makes no sense in this context (also because you want
a much shorter delay between "M is known" and "the corresponding
instruction is executed", so you have to decode the instruction(s)
before M is known).

Also Correct.

But instead of skipping, You can "predicate away" the undesirable instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

If the predicated instructions are used in at least 1 iteration, they
are not useless.

=== Stefan

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 29 16:58:42 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 5/28/2026 11:08 AM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

snip

Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your
understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get
predicated instructions into the reservation stations and eliminates the >> "jump" characteristic which required an exception to the no jump within
a VVM loop rule.

The PRED instruction (when used in vVM loop} produces a lane mask so
that different iterations of the loop are executed from the same
starting time.

And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.

That is what the then-clause and else-clause 'numbers' do--they set
up the vertical lane-mask for that iteration. And each lane calculates
its own lane mask.

Yes. But the proposed enhancement to my original proposal gives you
another way to specify which instructions get executed. Your original
PRED has a fixed mask to choose which instructions are executed or not, based on a binary condition test. This enhancement allows choosing
which instruction gets executed in each lane based on the value in a register. Thus lane 1 could execute instruction 3, lane 2 could execute instruction 5, etc. This is what allows you to gain the effect of
"indexed register accesses" at low cost (I believe one extra cycle.)

It is the latency of PARSDE+DECODE

My original proposal allows you to execute one instruction based on the value in a register, i.e. if the register contains the value 3, then the third instruction is the only one executed. The enhanced version allows more flexibility. For example, you could allow to specify execute all instructions up to the number in the register. As I said, while the use case for the basic instruction is clear, emulating register indexing, I
am not sure there are any use cases for the enhancement.

Not sure what the source code would look like in order for the compiler
to recognize this pattern and optimize to your solution.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 29 17:04:03 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 5/28/2026 4:29 PM, Stefan Monnier wrote:

----------------------

So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

Yes, size is clearly a limitation.

A 6-wide |u 16-deep execution window would allow between 90-and-96 instructions.

I think it works for MATMUL (8), but probably not for MATMUL (16).

You run out of registers anyway.

I don't know enough to know how practical

it would be to allow more than the current 32 instructions within a VVM loop.

It makes vVM harder for smaller machines.

Oh and BTW, forward FFT is 27 instructions/butterfly iteration.

As for performance, I expect (hope) that instructions other than
the one whose position matches the register value will not be executed.
If that can be done, then the extra cost is presumably one cycle, and it saves (in the MATUL (8) example), executing eight load instructions.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Jun 1 15:45:37 2026

From Newsgroup: comp.arch

On 5/29/2026 9:58 AM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

snip

My original proposal allows you to execute one instruction based on the
value in a register, i.e. if the register contains the value 3, then the
third instruction is the only one executed. The enhanced version allows
more flexibility. For example, you could allow to specify execute all
instructions up to the number in the register. As I said, while the use
case for the basic instruction is clear, emulating register indexing, I
am not sure there are any use cases for the enhancement.

Not sure what the source code would look like in order for the compiler
to recognize this pattern and optimize to your solution.

Good question. I have thought about it for a while, and though I am far
from a compiler expert, I have come up with a potential solution at
least for the basic proposal.

The idea is if the compiler sees a SWITCH statement where the clauses
that are switched to will compile to one instruction (not counting any instructions need for array addressing (which would be handled outside
the SWITCH, e.g. a loop counter), then it could emit the PREDNAT (or
whatever name is better) followed by the single instructions for each
clause. I am sure this needs more specificity, but I hope you get the idea.

I am still not sure of the benefit of the "enhanced" instruction, and
haven't come up with any reasonable source code that would benefit from it.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Jun 1 14:58:35 2026

From Newsgroup: comp.arch

MitchAlsup [2026-05-29 16:55:38] wrote:

Stefan Monnier <monnier@iro.umontreal.ca> posted:

But instead of skipping, You can "predicate away" the undesirable
instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

If the predicated instructions are used in at least 1 iteration, they
are not useless.

They may not be useless overall, but they still waste resources at each iteration where they're not used. Traditional predication of an `if`
gives a "50% waste" (for equal size branches or when each branch is
taken as often as the other), whereas a predicated `switch` results in
a waste of `N-1/N`. As N grows larger this becomes discouraging.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Jun 1 23:40:38 2026

From Newsgroup: comp.arch

On 6/1/2026 11:58 AM, Stefan Monnier wrote:

MitchAlsup [2026-05-29 16:55:38] wrote:

Stefan Monnier <monnier@iro.umontreal.ca> posted:

But instead of skipping, You can "predicate away" the undesirable
instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large >>> for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

If the predicated instructions are used in at least 1 iteration, they
are not useless.

They may not be useless overall, but they still waste resources at each iteration where they're not used. Traditional predication of an `if`
gives a "50% waste" (for equal size branches or when each branch is
taken as often as the other), whereas a predicated `switch` results in
a waste of `N-1/N`. As N grows larger this becomes discouraging.

OK, but what exactly are we wasting? We are taking space in the
reservation stations, but we are saving executing actual load
instructions. So the "waste" results in faster execution.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Geek2
  Thu Jul 2 11:41:05 2026
  from Euclid, Oh via Telnet
- Hannibal
  Thu Jul 2 05:49:27 2026
  from Des Moines via SSH
- Geek2
  Wed Jul 1 16:31:20 2026
  from Euclid, Oh via Telnet
- Hannibal
  Tue Jun 30 16:45:42 2026
  from Des Moines via SSH

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	70
Nodes:	6 (0 / 6)
Uptime:	39:23:01
Calls:	948
Calls today:	2
Files:	1,325
Messages:	280,644

Re: Matmul in VVM

Who's Online

Recent Visitors

System Info