Given the great popularity of the RISC architecture, I assumed that one of its characteristics, instructions that are all 32 bits in length, produced
a great increase in efficiency over variable-length instructions.
Therefore, I came up with the idea of using some opcode space for block headers which could contain information about the lengths of instructions, so as to make decoding variable-length instructions fully non-serialized, thus giving me the best of both worlds.
However, this involved overhead, and the headers would themselves take
time to decode. In any event, all the schemes I came up with were also elaborate and overly complicated.
But I have finally realized what I think is the decisive reason why I had been mistaken.
Before modern pipelined computers, which have multi-stage pipelines for instruction _execution_, a simple form of pipelining was very common - usually in the form of a three-stage fetch, decode, and execute pipeline.
Since the decoding of instructions can be so neatly separated from their execution, and thus performed well in advance of it, any overhead
associated with variable-length instructions becomes irrelevant because it essentially takes place very nearly completely in parallel to execution.
John Savard--- Synchronet 3.21a-Linux NewsLink 1.2
Given the great popularity of the RISC architecture, I assumed that one of >its characteristics, instructions that are all 32 bits in length, produced
a great increase in efficiency over variable-length instructions.
Before modern pipelined computers, which have multi-stage pipelines for >instruction _execution_, a simple form of pipelining was very common - >usually in the form of a three-stage fetch, decode, and execute pipeline. >Since the decoding of instructions can be so neatly separated from their >execution, and thus performed well in advance of it, any overhead
associated with variable-length instructions becomes irrelevant because it >essentially takes place very nearly completely in parallel to execution.
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually correspond to the end of the previous instruction,
John Savard <quadibloc@invalid.invalid> posted:
However, this involved overhead, and the headers would themselves take
time to decode. In any event, all the schemes I came up with were also
elaborate and overly complicated.
As I warned...
But I have finally realized what I think is the decisive reason why I
had been mistaken.
At Last ?!?
Before modern pipelined computers, which have multi-stage pipelines for
instruction _execution_, a simple form of pipelining was very common -
usually in the form of a three-stage fetch, decode, and execute
pipeline.
Since the decoding of instructions can be so neatly separated from
their execution, and thus performed well in advance of it, any overhead
associated with variable-length instructions becomes irrelevant because
it essentially takes place very nearly completely in parallel to
execution.
Or in other words, if you can decode K-instructions per cycle, you'd
better be able to execute K-instructions per cycle--or you have a
serious blockage in your pipeline.
John Savard <quadibloc@invalid.invalid> writes:
Given the great popularity of the RISC architecture, I assumed that one of >its characteristics, instructions that are all 32 bits in length, produced >a great increase in efficiency over variable-length instructions.
Some RISCs have that, some RISCs have two instruction lengths: 16 bits
and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
but then eliminated in Power), one variant of Berkeley RISC, ARM T32,
RISC-V with the C extension, and probably others.
Before modern pipelined computers, which have multi-stage pipelines for >instruction _execution_, a simple form of pipelining was very common - >usually in the form of a three-stage fetch, decode, and execute pipeline. >Since the decoding of instructions can be so neatly separated from their >execution, and thus performed well in advance of it, any overhead >associated with variable-length instructions becomes irrelevant because it >essentially takes place very nearly completely in parallel to execution.
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually correspond to the end of the previous instruction, but with 16-bit and
32-bit instructions this potentially doubles the amount of instruction decoders necessary, plus the circuit for selecting the ones that are
at actual instruction starts.
I guess that this is the reason why ARM
uses an uop cache in cores that can execute ARM T32. The fact that
more recent ARM A64-only cores have often no uop cache while their
A64+T32 predecessors have had one reinforces this idea.
OTOH, on AMD64/IA-32 Intel's recent E-Cores do not use an uop cache
either, but instead the most recent instances have 3 decoders each of
which can decode 3 instructions per cycle (i.e., they attempt to
decode at many more positions and then select 3 per cycle out of
those); so apparently even byte-oriented variable-length encoding can
be decoded quickly enough.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
Therefore, I came up with the idea of using some opcode space for block >headers which could contain information about the lengths of instructions, >so as to make decoding variable-length instructions fully non-serialized, >thus giving me the best of both worlds.
According to John Savard <quadibloc@invalid.invalid>:
Therefore, I came up with the idea of using some opcode space for
block headers which could contain information about the lengths of >instructions, so as to make decoding variable-length instructions
fully non-serialized, thus giving me the best of both worlds.
Sounds like the first two bits of the opcode in S/360 which tells you
the instruction format which also tells you how long the instruction
is.
They've added lots of new instructions since then with somewhat
different formats, but those bits still tell you how long the
instruction is. The first byte tells you what the fornat is so you
know what address calculations to do.
On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
Or in other words, if you can decode K-instructions per cycle, you'd
better be able to execute K-instructions per cycle--or you have a
serious blockage in your pipeline.
No.
If you flipped "decode" and "execute" in that sentence above, I would 100% >agree. And maybe this _is_ just a typo.
But if you actually did mean that sentence exactly as written, I would >disagree. This is why: I regard executing instructions as 'doing the
actual work' and decoding instructions as... some unfortunate trivial >overhead that can't be avoided.
On Fri, 19 Dec 2025 03:30:26 -0000 (UTC)
John Levine <johnl@taugh.com> wrote:
According to John Savard <quadibloc@invalid.invalid>:
Therefore, I came up with the idea of using some opcode space for
block headers which could contain information about the lengths of >instructions, so as to make decoding variable-length instructions
fully non-serialized, thus giving me the best of both worlds.
Sounds like the first two bits of the opcode in S/360 which tells you
the instruction format which also tells you how long the instruction
is.
They've added lots of new instructions since then with somewhat
different formats, but those bits still tell you how long the
instruction is. The first byte tells you what the fornat is so you
know what address calculations to do.
With very long pipelines that IBM is using starting from z10 (17 years
ago) it probably makes no difference.
The fact that there are only 3 options for instruction length is
important and simplifying things relatively to more than dozen of
options in x86, but how many bits one has to access in order to
determine the length of instruction is irrelevant or close to
irrelevant as long as they all reside near beginning rather than
anywhere, like in VAX.
John Savard <quadibloc@invalid.invalid> writes:
Given the great popularity of the RISC architecture, I assumed that one of >> its characteristics, instructions that are all 32 bits in length, produced >> a great increase in efficiency over variable-length instructions.
Some RISCs have that, some RISCs have two instruction lengths: 16 bits
and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
but then eliminated in Power), one variant of Berkeley RISC, ARM T32,
RISC-V with the C extension, and probably others.
Before modern pipelined computers, which have multi-stage pipelines for
instruction _execution_, a simple form of pipelining was very common -
usually in the form of a three-stage fetch, decode, and execute pipeline.
Since the decoding of instructions can be so neatly separated from their
execution, and thus performed well in advance of it, any overhead
associated with variable-length instructions becomes irrelevant because it >> essentially takes place very nearly completely in parallel to execution.
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually correspond to the end of the previous instruction, but with 16-bit and
32-bit instructions this potentially doubles the amount of instruction decoders necessary, plus the circuit for selecting the ones that are
at actual instruction starts. I guess that this is the reason why ARM
uses an uop cache in cores that can execute ARM T32. The fact that
more recent ARM A64-only cores have often no uop cache while their
A64+T32 predecessors have had one reinforces this idea.
OTOH, on AMD64/IA-32 Intel's recent E-Cores do not use an uop cache
either, but instead the most recent instances have 3 decoders each of
which can decode 3 instructions per cycle (i.e., they attempt to
decode at many more positions and then select 3 per cycle out of
those); so apparently even byte-oriented variable-length encoding can
be decoded quickly enough.
- anton
On 12/18/2025 4:25 PM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
Given the great popularity of the RISC architecture, I assumed that one of >> its characteristics, instructions that are all 32 bits in length, produced >> a great increase in efficiency over variable-length instructions.
Some RISCs have that, some RISCs have two instruction lengths: 16 bits
and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
but then eliminated in Power), one variant of Berkeley RISC, ARM T32, RISC-V with the C extension, and probably others.
I have come to realize that 32/64 is probably better than 16/32 here, primarily in terms of performance, but also helps with code-density (a
pure 32/64 encoding scheme can beat 16/32 in terms of code-density
despite only having larger instructions available).
One could argue "But MOV is less space efficient", can note that it also makes sense to try to design the compiler to minimize the number of unnecessary MOV instructions and similar (and when using the minimal
number of register moves, the lack of a small MOV encoding has less
effect on code density).
16/32/64 is also sensible, but the existence of 16-bit ops negatively effects encoding space (it is more of a strain to have both 16-bit ops
and 6-bit register fields; but at least some code can benefit from
having 64 GPRs).
So, say:
16/32: RV64GC (OK code density)
16/32/64: RV64GC+JX: Better code density than RV64GC.
32/64: RV64G+JX (seemingly slightly beats RV64GC)
But, not as much as GC+JX.
16/32/64/96: XG1 (still best code for density).
32/64/96: XG2 and XG3;
Also good for code density;
Somehow XG3 loses to XG2 despite being nearly 1:1;
Though, XG3 has mostly claimed the performance crown.
Or, descending, code-density:
XG1, RV64GC+JX, XG2, RV64G+JX, XG3, RV64GC, RV64G
And, performance:
XG3, XG2, RV64G+JX, XG1, RV64GC+JX, RV64G, RV64GC
Where, both the 16-bit ops, and some lacking features (in RV64G and
RV64GC) negatively effecting things.
Where, the main things that benefit JX here being:I have no prefixes {well CARRY}
Jumbo prefixes, extending Imm12/Disp12 to 33 bits;
Indexed Load/Store;check
Load/Store Pair;LDM, STM, ENTER, EXIT, MM, MS
Re-adding ADDWU/SUBWU and similar.{int,float}|u{OpCode}|u{Byte, Half, Word, DBLE}
The Zba instructions also help,
but Load/Store pair greatly reduces effect of Zba.
It would be possible to get better code density than 'C' with some tweaks:Which I why -#imm5 works better.
Reducing many of the imm/disp fields by 1 bit;
Would free up a lot of encoding space.
Imm6/Disp6 eats too much encoding space here.
Making most of the register fields 4 bits (X8..X23)
Can improve hit-rate notably over Reg3.
But:
Main merit of 'C' is compatibility with binaries that use 'C';
This merit would be lost by modifying or replacing 'C'.
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually correspond to the end of the previous instruction, but with 16-bit and 32-bit instructions this potentially doubles the amount of instruction decoders necessary, plus the circuit for selecting the ones that are
at actual instruction starts. I guess that this is the reason why ARM
uses an uop cache in cores that can execute ARM T32. The fact that
more recent ARM A64-only cores have often no uop cache while their
A64+T32 predecessors have had one reinforces this idea.
I took the option of not bothering with parallel execution for 16-bit ops.
Even if 16-bit ops could be superscalar though, the benefits would be
small: Code patterns that favor 16-bit ops also tend to be lower in
terms of available ILP.
Decoding at 2 or 3 wide seems to make the most sense:
Gets a nice speedup over 1;
Works with in-order.
Here, 3 is slightly better than 2.
But, getting that much benefit from going any wider than this, is likely
to require some amount of "heavy lifting".
So, while a 4 or 5 wide in-order design could be possible, pretty much
no normal code is going to have enough ILP to make it worthwhile over 2
or 3.
Also 2 or 3 works reasonably well with a 96-bit fetch:
One trick here could be to precompute a lot of this when fetching cache lines, though a full instruction length could not be determined at fetch time if the instruction crosses a cache line unless we have also fetched
the next cache line. Full instruction length could be determine in
advance (at fetch time) if it always fetches both cache-lines and then determines the lengths for one of them before writing to the cache
(possibly if the next line is fetched, it contents are not written to
the cache as lengths can't be fully determined yet).
I have come to realize that 32/64 is probably better than 16/32 here,
primarily in terms of performance, but also helps with code-density (a
pure 32/64 encoding scheme can beat 16/32 in terms of code-density
despite only having larger instructions available).
My 66000 does not even bother with 16-bit instructions--and still ends
up requiring fewer instruction count than RISC-V. {32, 64, 96, 128, 160}
are the instruction sizes; with no instructions ever requiring constants
to be assembled.
Most of the MOV instructions in My 66000 are found::[...]
a) before a call--moving values to argument positions,
b) after a call--moving results to post-call positions,
c) around loops --moving values for next loop iteration.
I suspect that argument setup before and result take-down after call
would have quite a bit of parallelism. I suspect that moving fields
around for the next loop iteration would have significant parallelism.
BGB <cr88192@gmail.com> posted:
On 12/18/2025 4:25 PM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
Given the great popularity of the RISC architecture, I assumed that one of >>>> its characteristics, instructions that are all 32 bits in length, produced >>>> a great increase in efficiency over variable-length instructions.
Some RISCs have that, some RISCs have two instruction lengths: 16 bits
and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
but then eliminated in Power), one variant of Berkeley RISC, ARM T32,
RISC-V with the C extension, and probably others.
I have come to realize that 32/64 is probably better than 16/32 here,
primarily in terms of performance, but also helps with code-density (a
pure 32/64 encoding scheme can beat 16/32 in terms of code-density
despite only having larger instructions available).
My 66000 does not even bother with 16-bit instructions--and still ends
up requiring fewer instruction count than RISC-V. {32, 64, 96, 128, 160}
are the instruction sizes; with no instructions ever requiring constants
to be assembled.
One could argue "But MOV is less space efficient", can note that it also
makes sense to try to design the compiler to minimize the number of
unnecessary MOV instructions and similar (and when using the minimal
number of register moves, the lack of a small MOV encoding has less
effect on code density).
Most of the MOV instructions in My 66000 are found::
a) before a call--moving values to argument positions,
b) after a call--moving results to post-call positions,
c) around loops --moving values for next loop iteration.
16/32/64 is also sensible, but the existence of 16-bit ops negatively
effects encoding space (it is more of a strain to have both 16-bit ops
and 6-bit register fields; but at least some code can benefit from
having 64 GPRs).
I agree than RISC-V HAS too many 16-bit instructions, and that it gains
too little in the code density department by having them.
So, say:
16/32: RV64GC (OK code density)
16/32/64: RV64GC+JX: Better code density than RV64GC.
32/64: RV64G+JX (seemingly slightly beats RV64GC)
But, not as much as GC+JX.
16/32/64/96: XG1 (still best code for density).
32/64/96: XG2 and XG3;
Also good for code density;
Somehow XG3 loses to XG2 despite being nearly 1:1;
Though, XG3 has mostly claimed the performance crown.
Or, descending, code-density:
XG1, RV64GC+JX, XG2, RV64G+JX, XG3, RV64GC, RV64G
And, performance:
XG3, XG2, RV64G+JX, XG1, RV64GC+JX, RV64G, RV64GC
Rather than tracking code density--which measures cache performance,
I have come to think that counting instructions themselves is the key.
If the instruction is present then it ahs to be executed, if not, then
it was free !! in all real senses.
Where, both the 16-bit ops, and some lacking features (in RV64G and
RV64GC) negatively effecting things.
Like a reasonable OpCode layout.
Where, the main things that benefit JX here being:I have no prefixes {well CARRY}
Jumbo prefixes, extending Imm12/Disp12 to 33 bits;
-#Imm5, Imm16, Imm32, Imm64, Disp16, Disp32, Disp64
Indexed Load/Store;check
Load/Store Pair;LDM, STM, ENTER, EXIT, MM, MS
Re-adding ADDWU/SUBWU and similar.{int,float}|u{OpCode}|u{Byte, Half, Word, DBLE}
The Zba instructions also help,Which I why -#imm5 works better.
but Load/Store pair greatly reduces effect of Zba.
It would be possible to get better code density than 'C' with some tweaks: >> Reducing many of the imm/disp fields by 1 bit;
Would free up a lot of encoding space.
Imm6/Disp6 eats too much encoding space here.
Making most of the register fields 4 bits (X8..X23)
Can improve hit-rate notably over Reg3.
But:
Main merit of 'C' is compatibility with binaries that use 'C';
This merit would be lost by modifying or replacing 'C'.
I can still fit my entire ISA into the space vacated by C. ----------------------
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually
correspond to the end of the previous instruction, but with 16-bit and
32-bit instructions this potentially doubles the amount of instruction
decoders necessary, plus the circuit for selecting the ones that are
at actual instruction starts. I guess that this is the reason why ARM
uses an uop cache in cores that can execute ARM T32. The fact that
more recent ARM A64-only cores have often no uop cache while their
A64+T32 predecessors have had one reinforces this idea.
I took the option of not bothering with parallel execution for 16-bit ops.
I took the option of not bothering with 16-bit Ops.
-----------------------
Even if 16-bit ops could be superscalar though, the benefits would be
small: Code patterns that favor 16-bit ops also tend to be lower in
terms of available ILP.
I suspect that argument setup before and result take-down after call
would have quite a bit of parallelism.
I suspect that moving fields around for the next loop iteration would
have significant parallelism.
------------------------------
Decoding at 2 or 3 wide seems to make the most sense:
Gets a nice speedup over 1;
Works with in-order.
Here, 3 is slightly better than 2.
But, getting that much benefit from going any wider than this, is likely
to require some amount of "heavy lifting".
Probably no conducive to FPGA implementations due to LUT count and
special memories {predictors, ..., TLBs, staging buffers, ...}
So, while a 4 or 5 wide in-order design could be possible, pretty much
no normal code is going to have enough ILP to make it worthwhile over 2
or 3.
1-wide 0.7 IPC
2-wide 1.0 IPC gain of 50%
3-wide 1.4 IPC gain of 40%
6-wide 2.2 IPC gain of 50% from doubling the width
10wide 3.2 IPC gain of 50% from almost doubling width
Also 2 or 3 works reasonably well with a 96-bit fetch:
But Fetches ae 128-bits wide !!! and the average instruction is 35-bits wide.
------------------------
One trick here could be to precompute a lot of this when fetching cache
lines, though a full instruction length could not be determined at fetch
time if the instruction crosses a cache line unless we have also fetched
the next cache line. Full instruction length could be determine in
advance (at fetch time) if it always fetches both cache-lines and then
determines the lengths for one of them before writing to the cache
(possibly if the next line is fetched, it contents are not written to
the cache as lengths can't be fully determined yet).
All of the above was solved in Athlon, and then made 3|u smaller in Opteron at the cost of 1 pipe stage in DECODE.
BGB <cr88192@gmail.com> posted:
On 12/18/2025 4:25 PM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
Given the great popularity of the RISC architecture, I assumed that one of >>>> its characteristics, instructions that are all 32 bits in length, produced >>>> a great increase in efficiency over variable-length instructions.
Some RISCs have that, some RISCs have two instruction lengths: 16 bits
and 32 bits: IIRC one variant of the IBM 801 (inherited by the ROMP,
but then eliminated in Power), one variant of Berkeley RISC, ARM T32,
RISC-V with the C extension, and probably others.
I have come to realize that 32/64 is probably better than 16/32 here,
primarily in terms of performance, but also helps with code-density (a
pure 32/64 encoding scheme can beat 16/32 in terms of code-density
despite only having larger instructions available).
My 66000 does not even bother with 16-bit instructions--and still ends
up requiring fewer instruction count than RISC-V. {32, 64, 96, 128, 160}
are the instruction sizes; with no instructions ever requiring constants
to be assembled.
One could argue "But MOV is less space efficient", can note that it also
makes sense to try to design the compiler to minimize the number of
unnecessary MOV instructions and similar (and when using the minimal
number of register moves, the lack of a small MOV encoding has less
effect on code density).
Most of the MOV instructions in My 66000 are found::
a) before a call--moving values to argument positions,
b) after a call--moving results to post-call positions,
c) around loops --moving values for next loop iteration.
16/32/64 is also sensible, but the existence of 16-bit ops negatively
effects encoding space (it is more of a strain to have both 16-bit ops
and 6-bit register fields; but at least some code can benefit from
having 64 GPRs).
I agree than RISC-V HAS too many 16-bit instructions, and that it gains
too little in the code density department by having them.
So, say:
16/32: RV64GC (OK code density)
16/32/64: RV64GC+JX: Better code density than RV64GC.
32/64: RV64G+JX (seemingly slightly beats RV64GC)
But, not as much as GC+JX.
16/32/64/96: XG1 (still best code for density).
32/64/96: XG2 and XG3;
Also good for code density;
Somehow XG3 loses to XG2 despite being nearly 1:1;
Though, XG3 has mostly claimed the performance crown.
Or, descending, code-density:
XG1, RV64GC+JX, XG2, RV64G+JX, XG3, RV64GC, RV64G
And, performance:
XG3, XG2, RV64G+JX, XG1, RV64GC+JX, RV64G, RV64GC
Rather than tracking code density--which measures cache performance,
I have come to think that counting instructions themselves is the key.
If the instruction is present then it ahs to be executed, if not, then
it was free !! in all real senses.
Where, both the 16-bit ops, and some lacking features (in RV64G and
RV64GC) negatively effecting things.
Like a reasonable OpCode layout.
Where, the main things that benefit JX here being:I have no prefixes {well CARRY}
Jumbo prefixes, extending Imm12/Disp12 to 33 bits;
-#Imm5, Imm16, Imm32, Imm64, Disp16, Disp32, Disp64
Indexed Load/Store;check
Load/Store Pair;LDM, STM, ENTER, EXIT, MM, MS
Re-adding ADDWU/SUBWU and similar.{int,float}|u{OpCode}|u{Byte, Half, Word, DBLE}
The Zba instructions also help,Which I why -#imm5 works better.
but Load/Store pair greatly reduces effect of Zba.
It would be possible to get better code density than 'C' with some tweaks: >> Reducing many of the imm/disp fields by 1 bit;
Would free up a lot of encoding space.
Imm6/Disp6 eats too much encoding space here.
Making most of the register fields 4 bits (X8..X23)
Can improve hit-rate notably over Reg3.
But:
Main merit of 'C' is compatibility with binaries that use 'C';
This merit would be lost by modifying or replacing 'C'.
I can still fit my entire ISA into the space vacated by C.
----------------------
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually
correspond to the end of the previous instruction, but with 16-bit and
32-bit instructions this potentially doubles the amount of instruction
decoders necessary, plus the circuit for selecting the ones that are
at actual instruction starts. I guess that this is the reason why ARM
uses an uop cache in cores that can execute ARM T32. The fact that
more recent ARM A64-only cores have often no uop cache while their
A64+T32 predecessors have had one reinforces this idea.
I took the option of not bothering with parallel execution for 16-bit ops.
I took the option of not bothering with 16-bit Ops.
-----------------------
Even if 16-bit ops could be superscalar though, the benefits would be
small: Code patterns that favor 16-bit ops also tend to be lower in
terms of available ILP.
I suspect that argument setup before and result take-down after call
would have quite a bit of parallelism.
I suspect that moving fields around for the next loop iteration would
have significant parallelism.
------------------------------
Decoding at 2 or 3 wide seems to make the most sense:
Gets a nice speedup over 1;
Works with in-order.
Here, 3 is slightly better than 2.
But, getting that much benefit from going any wider than this, is likely
to require some amount of "heavy lifting".
Probably no conducive to FPGA implementations due to LUT count and
special memories {predictors, ..., TLBs, staging buffers, ...}
So, while a 4 or 5 wide in-order design could be possible, pretty much
no normal code is going to have enough ILP to make it worthwhile over 2
or 3.
1-wide 0.7 IPC
2-wide 1.0 IPC gain of 50%
3-wide 1.4 IPC gain of 40%
6-wide 2.2 IPC gain of 50% from doubling the width
10wide 3.2 IPC gain of 50% from almost doubling width
Also 2 or 3 works reasonably well with a 96-bit fetch:
But Fetches ae 128-bits wide !!! and the average instruction is 35-bits wide. ------------------------
One trick here could be to precompute a lot of this when fetching cache
lines, though a full instruction length could not be determined at fetch
time if the instruction crosses a cache line unless we have also fetched
the next cache line. Full instruction length could be determine in
advance (at fetch time) if it always fetches both cache-lines and then
determines the lengths for one of them before writing to the cache
(possibly if the next line is fetched, it contents are not written to
the cache as lengths can't be fully determined yet).
All of the above was solved in Athlon, and then made 3|u smaller in Opteron at the cost of 1 pipe stage in DECODE.
On 2025-12-19 6:36 p.m., MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
So, while a 4 or 5 wide in-order design could be possible, pretty much
no normal code is going to have enough ILP to make it worthwhile over 2
or 3.
1-wide 0.7 IPC
2-wide 1.0 IPC gain of 50%
3-wide 1.4 IPC gain of 40%
6-wide 2.2 IPC gain of 50% from doubling the width
10wide 3.2 IPC gain of 50% from almost doubling width
Also 2 or 3 works reasonably well with a 96-bit fetch:
But Fetches ae 128-bits wide !!! and the average instruction is 35-
bits wide.
Could the average instruction size be an argument for the use of wider (40-bits) instructions? One would think that the instruction should be a
bit wider than average. Bin trying to shrink Qupls4 instructions down to 40-bits for Qupls5. The odd size is not that great an issue if variable lengths are supported.
Could the average instruction size be an argument for the use of wider (40-bits) instructions? One would think that the instruction should be a
bit wider than average. Bin trying to shrink Qupls4 instructions down to 40-bits for Qupls5. The odd size is not that great an issue if variable lengths are supported.
Robert Finch <robfi680@gmail.com> schrieb:
Could the average instruction size be an argument for the use of wider
(40-bits) instructions? One would think that the instruction should be a
bit wider than average. Bin trying to shrink Qupls4 instructions down to
40-bits for Qupls5. The odd size is not that great an issue if variable
lengths are supported.
Can you show a few examples of what these wide instructions are used
for? Seems like a lot of bits to me...
On 2025-12-20 5:47 a.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Could the average instruction size be an argument for the use of wider
(40-bits) instructions? One would think that the instruction should be a >>> bit wider than average. Bin trying to shrink Qupls4 instructions down to >>> 40-bits for Qupls5. The odd size is not that great an issue if variable
lengths are supported.
Can you show a few examples of what these wide instructions are used
for?-a Seems like a lot of bits to me...
-aThe average instruction size being about 35 bits most likely comes
from including immediate constant bits. 32-bits works for most things,
if one is willing to increase the dynamic instruction count.
-aBut extra bits can be used for selecting small immediates, vector register selection.
-a7 opcode
-a6 dest reg + sign control
-a6 src1 reg + sign control
-a6 src2 reg + sign control
-a7 func code
-a-------
-a32
-aQupls4 has the addional bits for
-a6 src3 reg + sign control
-a4 vector register select
-a3 small immediate select
-a3 second ALU op
---
16
-aFor immediates Qupls4 has
-a 7 opcode
-a 6 dst reg
-a 6 src1 reg
-a 2 precision control
-a27 bit immediate constant
Having three source registers allows: fused multiply add, bit field ops, three input adds, multiplex, and a few others. There are some
instructions with four source register (fused dot product) or two destinations (carry outputs / fp status).
On 2025-12-19 6:36 p.m., MitchAlsup wrote:
BGB <cr88192@gmail.com> posted: --------------------------------------------
Probably no conducive to FPGA implementations due to LUT count and
special memories {predictors, ..., TLBs, staging buffers, ...}
So, while a 4 or 5 wide in-order design could be possible, pretty much
no normal code is going to have enough ILP to make it worthwhile over 2
or 3.
1-wide 0.7 IPC
2-wide 1.0 IPC gain of 50%
3-wide 1.4 IPC gain of 40%
6-wide 2.2 IPC gain of 50% from doubling the width
10wide 3.2 IPC gain of 50% from almost doubling width
Also 2 or 3 works reasonably well with a 96-bit fetch:
But Fetches ae 128-bits wide !!! and the average instruction is 35-bits wide.
Could the average instruction size be an argument for the use of wider (40-bits) instructions?
One would think that the instruction should be a
bit wider than average. Bin trying to shrink Qupls4 instructions down to 40-bits for Qupls5. The odd size is not that great an issue if variable lengths are supported.
------------------------
One trick here could be to precompute a lot of this when fetching cache
lines, though a full instruction length could not be determined at fetch >> time if the instruction crosses a cache line unless we have also fetched >> the next cache line. Full instruction length could be determine in
advance (at fetch time) if it always fetches both cache-lines and then
determines the lengths for one of them before writing to the cache
(possibly if the next line is fetched, it contents are not written to
the cache as lengths can't be fully determined yet).
All of the above was solved in Athlon, and then made 3|u smaller in Opteron at the cost of 1 pipe stage in DECODE.
I have come to realize that 32/64 is probably better than 16/32 here,
primarily in terms of performance, but also helps with code-density (a
pure 32/64 encoding scheme can beat 16/32 in terms of code-density
despite only having larger instructions available).
My 66000 does not even bother with 16-bit instructions--and still ends
up requiring fewer instruction count than RISC-V. {32, 64, 96, 128, 160} are the instruction sizes; with no instructions ever requiring constants
to be assembled.
Indeed, My 66000 aims for "fat" instructions so as to try and reduce instruction counts. That should hopefully result in an efficient ISA:
fewer instructions should cost less runtime resources (as long as they
don't get split into more ++ops).
Most of the MOV instructions in My 66000 are found::[...]
a) before a call--moving values to argument positions,
b) after a call--moving results to post-call positions,
c) around loops --moving values for next loop iteration.
I suspect that argument setup before and result take-down after call
would have quite a bit of parallelism. I suspect that moving fields
around for the next loop iteration would have significant parallelism.
Are you saying that you expect the efficiency of My 66000 could be
improved by adding some way to express those moves in a better way?
A key element of the Mill is/was its ability to "permute" its belt
elements in a single cycle. I still don't fully understand how this is encoded in the ISA and implemented in hardware, but it sounds like
you're hinting in the same direction: some kind of "parallel move" instruction with many inputs and many outputs.
Stefan--- Synchronet 3.21a-Linux NewsLink 1.2
On 2025-12-20 5:47 a.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Could the average instruction size be an argument for the use of wider
(40-bits) instructions? One would think that the instruction should be a >> bit wider than average. Bin trying to shrink Qupls4 instructions down to >> 40-bits for Qupls5. The odd size is not that great an issue if variable
lengths are supported.
Can you show a few examples of what these wide instructions are used
for? Seems like a lot of bits to me...
The average instruction size being about 35 bits most likely comes
from including immediate constant bits. 32-bits works for most things,
if one is willing to increase the dynamic instruction count.
But extra bits can be used for selecting small immediates, vector
register selection.
7 opcode
6 dest reg + sign control
6 src1 reg + sign control
6 src2 reg + sign control
7 func code
-------
32
Qupls4 has the addional bits for
6 src3 reg + sign control
4 vector register select
3 small immediate select
3 second ALU op
---
16
For immediates Qupls4 has
7 opcode
6 dst reg
6 src1 reg
2 precision control
27 bit immediate constant
Having three source registers allows: fused multiply add, bit field ops, three input adds, multiplex, and a few others.
There are some
instructions with four source register (fused dot product) or two destinations (carry outputs / fp status).
For argument setup (calling side) one needs MOV {R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the
constraints of the GBOoO Data Path.
For argument setup (calling side) one needs MOV {R1..R5},{Rm,Rn,Rj,Rk,Rl} For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
In terms of encoding, these are fairly easy and could each fit within
a 32bit instruction.
For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
IIUC these could have any number of registers and the destination and
source regs can be "anything", so the encoding would take up more space. Arguably it might be possible in many/most cases to arrange for
{Rm,Rn,Rj} to be {R1..Rn}, so it might be able to use the same
instruction as the call-setup.
I just can't see how to make these run reasonably fast within the constraints of the GBOoO Data Path.
Hmm... One would hope this can be handled entirely in the renamer
without touching the actual data path, but ... sorry: if you don't know
how to do it, I sure don't either.
Stefan--- Synchronet 3.21a-Linux NewsLink 1.2
Stefan Monnier <monnier@iro.umontreal.ca> posted:
For argument setup (calling side) one needs MOV {R1..R5},{Rm,Rn,Rj,Rk,Rl} >>> For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
In terms of encoding, these are fairly easy and could each fit within
a 32bit instruction.
You are going to put 6|u5-bit fields in a single 32-bit instruction with
a 6-bit Major OpCode ?!?! I would like to see it done. Remember: all specifiers are in the first 32-bits of the "instruction" only constants
are used as Variable Length.
For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
IIUC these could have any number of registers and the destination and
source regs can be "anything", so the encoding would take up more space.
Arguably it might be possible in many/most cases to arrange for
{Rm,Rn,Rj} to be {R1..Rn}, so it might be able to use the same
instruction as the call-setup.
In principle I buy this argument:: in practice I can't see it happening.
I can see an encoding that would provide a "bunch of MOVs/Renames"
but only if I disobey a principle tenet of ISA encoding {One that RISC-V threw away on day 1} and that is; the register specification fields are
at fixed locations. It is this tenet that removed some <arguably thin>
logic before multiplexing the specifiers into the RF decoder. The fixed position argument has neither the logic nor the multiplexer, RF specifiers are wired directly to the RF/Renamer decoder ports directly.
I just can't see how to make these run reasonably fast within the
constraints of the GBOoO Data Path.
Hmm... One would hope this can be handled entirely in the renamer
without touching the actual data path, but ... sorry: if you don't know
how to do it, I sure don't either.
Once one goes beyond the 3-operand 1-result property, all sorts of little things start to break--like multiplexing the RF specifiers. The Data-Path
and the Register/Renamer ports are all designed to this FMAC requirement, giving us CMOV and INSert instructions with reasonable encodings.
Right now, there are no register specifiers in the variable length part
of ISA--just constants.
It is also not exactly clear how one "makes" an instruction with {2,3,4,5,6,7}
writes traverse the pipeline smoothly. I took serious consideration to find an smooth solution to even {2} results, and for this I built an accumulator attached to the 3-operand+1-result function units where the added operand is read once (if needed) and written once (if needed) often not requiring ANY
RF activity in support of the CARRY variable itself.
Stefan
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the
constraints of the GBOoO Data Path.
I made a 36-bit ISA a while ago on the notion of average instruction
size (18-bit compressed instructions). Writing the assembler for it was something special. Took about 10x as much effort as a byte-oriented one.
So, I am not keen on non-byte sizes. But maybe another 36-bitter. Takes
a bit to get used to addressing for the instruction pointer.
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken here...
when I read this, I thought that there was a standard technique for doing stuff like that in a GBOoO machine.
Just break down all the fancy instructions into RISC-style pseudo-ops. But apparently, since you would know all about that, there must be a reason why it doesn't apply in these cases.
John Savard--- Synchronet 3.21a-Linux NewsLink 1.2
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the
constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken
here...
when I read this, I thought that there was a standard technique for doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no different than any other calculation, except that no mangling of the
bits is going on.
Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you would
know all about that, there must be a reason why it doesn't apply in these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
John Savard <quadibloc@invalid.invalid> posted:
However, this involved overhead, and the headers would themselves take
time to decode. In any event, all the schemes I came up with were also
elaborate and overly complicated.
As I warned...
I wasn't blind - of course I knew that all along. But what I still did
fail to see was any good alternative.
But I have finally realized what I think is the decisive reason why I
had been mistaken.
At Last ?!?
On further reflection, I think I had realized that decoding is done ahead
of execution, and thus can be thought of as mostly done in parallel with
it, before. But decoding still has to be done *first* before execution can start. So I felt that in a design where super-aggressive pipelining or vectorization allows many instructions to be done in parallel, if decoding is necessarily serial, it could still become a bottleneck.
Before modern pipelined computers, which have multi-stage pipelines for
instruction _execution_, a simple form of pipelining was very common -
usually in the form of a three-stage fetch, decode, and execute
pipeline.
Since the decoding of instructions can be so neatly separated from
their execution, and thus performed well in advance of it, any overhead
associated with variable-length instructions becomes irrelevant because
it essentially takes place very nearly completely in parallel to
execution.
Or in other words, if you can decode K-instructions per cycle, you'd
better be able to execute K-instructions per cycle--or you have a
serious blockage in your pipeline.
No.
If you flipped "decode" and "execute" in that sentence above, I would 100% agree. And maybe this _is_ just a typo.
But if you actually did mean that sentence exactly as written, I would disagree. This is why: I regard executing instructions as 'doing the
actual work' and decoding instructions as... some unfortunate trivial overhead that can't be avoided.
Hence, if I can decode instructions much faster than I can execute them... _possibly_ the decoder is overdesigned, but it's also perfectly possible that there isn't really a slower decoder design that would make sense.
And maybe this perspective _explains_ why I dabbled in elaborate schemes
to allow decoding in parallel. I absolutely refused to allow decoding to become a bottleneck, no matter how aggressively OoO the execution part is designed for breakneck speed at all costs.
John Savard--- Synchronet 3.21a-Linux NewsLink 1.2
On 12/21/2025 10:12 AM, MitchAlsup wrote:
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the
constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken
here...
when I read this, I thought that there was a standard technique for doing >> stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no different than any other calculation, except that no mangling of the
bits is going on.
Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you would >> know all about that, there must be a reason why it doesn't apply in these >> cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..
You are going to put 6|u5-bit fields in a single 32-bit instruction withFor argument setup (calling side) one needs MOV {R1..R5},{Rm,Rn,Rj,Rk,Rl} >> > For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}In terms of encoding, these are fairly easy and could each fit within
a 32bit instruction.
a 6-bit Major OpCode ?!?!
I would like to see it done.
I can see an encoding that would provide a "bunch of MOVs/Renames"
but only if I disobey a principle tenet of ISA encoding {One that RISC-V threw away on day 1} and that is; the register specification fields are
at fixed locations. It is this tenet that removed some <arguably thin>
logic before multiplexing the specifiers into the RF decoder. The fixed position argument has neither the logic nor the multiplexer, RF specifiers are wired directly to the RF/Renamer decoder ports directly.
It is also not exactly clear how one "makes" an instruction with {2,3,4,5,6,7} writes traverse the pipeline smoothly. I took serious consideration to find an smooth solution to even {2} results, and for
this I built an accumulator attached to the 3-operand+1-result
function units where the added operand is read once (if needed) and
written once (if needed) often not requiring ANY RF activity in
support of the CARRY variable itself.
The way I see it, the problem is that after
MULTIMOVE {R1,R2} <= {R6,R8}
the preceding instruction which generated a result into R6 now needs to
put the result into both R6 and R1.
Maybe a way to avoid that problem
is to make the renaming architectural. I.e. add a "register renaming
table" (RRT), and introduce the instruction RENAME which changes
that RRT. Whenever an instruction wants to read register Rn, the actual >architectural register we'll read is obtained by passing `n` through RRT.
On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually
correspond to the end of the previous instruction,
Oh, yes, I had always realized that, but dismissed it as far too wasteful.
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc} >>>>>
I just can't see how to make these run reasonably fast within the
constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken >>>> here...
when I read this, I thought that there was a standard technique for doing >>>> stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no
different than any other calculation, except that no mangling of the
bits is going on.
Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you would >>>> know all about that, there must be a reason why it doesn't apply in these >>>> cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember
something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..
The 2-operand+displacement LD/STs have a lock bit in the instruction--that
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier {A.K.A. a prefix}.
John Savard <quadibloc@invalid.invalid> wrote:
On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually
correspond to the end of the previous instruction,
Oh, yes, I had always realized that, but dismissed it as far too wasteful.
Well, Mitch claims average 35 bits per instructions, that means
about 90% utilization of decoders, so not bad. Probably more
waste is due to muxes needed to shift instructions into right
positions, but since you allow variant encodings you need
muxes too.
Also, consider that alternative to variable length instructions
is to use longer instructions or more of them. In case of constants
classic RISC needs more instructions to assemble a constant than
extra words in variable length encoding. So classic RISC is
going to need more "decode events" than machine using variable
length encoding with 32-bit units, even though all "decode events"
are "useful" on RISC and some "decode events" on variable length
machine are discarded.
Maybe a way to avoid that problem
is to make the renaming architectural. I.e. add a "register renaming >>table" (RRT), and introduce the instruction RENAME which changes
that RRT. Whenever an instruction wants to read register Rn, the actual >>architectural register we'll read is obtained by passing `n` through RRT.
All of that happens with microarchitectural renaming (your RRT is
called RAT (register alias table), however). Your "RENAME"
instruction is called "MOV". Why make the RAT architectural?
Maybe a way to avoid that problem
is to make the renaming architectural. I.e. add a "register renaming >>table" (RRT), and introduce the instruction RENAME which changes
that RRT. Whenever an instruction wants to read register Rn, the actual >>architectural register we'll read is obtained by passing `n` through RRT.
All of that happens with microarchitectural renaming (your RRT is
called RAT (register alias table), however). Your "RENAME"
instruction is called "MOV". Why make the RAT architectural?
Good question. I was just reacting to Mitch who seemed to say that one
of the main problems with a multi-move instruction is that it has too
many output and that doesn't fit into the general design, so by making
the RRT/RAT architectural it makes the instruction single-output.
I don't know if in practice it would make any difference.
Stefan--- Synchronet 3.21a-Linux NewsLink 1.2
John Savard <quadibloc@invalid.invalid> wrote:
On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually
correspond to the end of the previous instruction,
Oh, yes, I had always realized that, but dismissed it as far too
wasteful.
Well, Mitch claims average 35 bits per instructions, that means about
90% utilization of decoders, so not bad.
Also, consider that alternative to variable length instructions is to
use longer instructions or more of them.
Well, Mitch claims average 35 bits per instructions, that means aboutHis minimum instruction size is 32 bits, but I was going for 16 bits.
90% utilization of decoders, so not bad.
Well, Mitch claims average 35 bits per instructions, that means aboutHis minimum instruction size is 32 bits, but I was going for 16 bits.
90% utilization of decoders, so not bad.
BTW, my understanding of Mitch's design is that this is related to instruction complexity: if you support 16bit instructions, it means you support instructions which presumably don't do very much work because
it's hard to express a lot of "work to do" in 16bit.
Instead, the My 66000 ISA tries to make instructions fatter, so as to
reduce the number of instructions rather than the size of each instruction. And the idea is that this applies both to static and to dynamic counts.
That's why Mitch includes negation and sign-extension directly inside
every arithmetic instruction. The hope is that they don't increase the critical path (in the combinatory logic of a single cycle), or they
increase it less than the corresponding decrease in the other critical
path (the one in the dataflow graph of instructions).
Another way to look at it: For the execution of any specific
instruction, we spend N1 gate-delays on useful work, N2 gate-delays
waiting for the end of the cycle (because the duration of cycle is based
on the maximum of all possible N1s), and N3 gate-delays on latching.
Fatter instructions are a way to try and reduce N2 and the number of
times we pay N3.
I wish I knew how to make an ISA where the single cycle instructions
can perform even more work like two or more dependent additions.
[ I mean, I know of ways to do it, but they all tend to increase N2
much too much on average. ]
Stefan--- Synchronet 3.21a-Linux NewsLink 1.2
Stefan Monnier <monnier@iro.umontreal.ca> posted:
I wish I knew how to make an ISA where the single cycle instructionsData-General Nova. However, with a modern RISC-like ISA, there are not
can perform even more work like two or more dependent additions.
enough small-shifts to amortize--except in the memory addressing arena
where scaled indexing saves instructions and cycles.
MitchAlsup [2025-12-28 17:59:02] wrote:
Stefan Monnier <monnier@iro.umontreal.ca> posted:
I wish I knew how to make an ISA where the single cycle instructionsData-General Nova. However, with a modern RISC-like ISA, there are not enough small-shifts to amortize--except in the memory addressing arena where scaled indexing saves instructions and cycles.
can perform even more work like two or more dependent additions.
My thoughts were something along the lines of having fat instructions
like a 3-in 2-out 2-op instruction that does:
Rd1 <= Rs1 OP1 Rs2;
Rd2 <= Rs3 OP2 Rd1;
so your datapath has two ALUs back to back in a single cycle.
And the problem is that it's often hard to find something useful to do in that
OP2.
To increase the use of OP2 you need to allow as many combinations
of OP1 and OP2 and that quickly bumps into the constraints that OP1+OP2
are done in a single cycle, so neither OP1 nor OP2 can usefully be
memory memory access or control flow operations.
Those 2 ALUs would likely lengthen the cycle by significantly more than
your single gate-of-delay, so it's important for OP2 to make useful work
most of the time, otherwise we just increased the average N2.
[ And then there's the impact of 3-in 2-out on the pipeline, and the fact
that such a multi-op instruction doesn't fit in 32bit, of course. ]
Stefan--- Synchronet 3.21a-Linux NewsLink 1.2
On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
John Savard <quadibloc@invalid.invalid> wrote:
On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually >>>> correspond to the end of the previous instruction,
Oh, yes, I had always realized that, but dismissed it as far too
wasteful.
Well, Mitch claims average 35 bits per instructions, that means about
90% utilization of decoders, so not bad.
His minimum instruction size is 32 bits, but I was going for 16 bits.
Also, consider that alternative to variable length instructions is to
use longer instructions or more of them.
What I did instead was use variable-length instructions, but add a prefix
at the beginning of any 256-bit block of instructions that contained them which directly showed where each instruction began.
My intent was to avoid the disadvantages you identify for fixed-length instructions, but avoid the disadvantage of variable-length instructions
too.
One might notice that None of the SPARC generations were anywhere
close to the frequency of the more typical RISCs.
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3} >>>>>>
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the
constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken >>>>> here...
when I read this, I thought that there was a standard technique for >>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no
different than any other calculation, except that no mangling of the
bits is going on.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you
would
know all about that, there must be a reason why it doesn't apply in >>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember
something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..
The 2-operand+displacement LD/STs have a lock bit in the instruction--
that
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On 12/28/2025 8:22 AM, John Savard wrote:
On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
John Savard <quadibloc@invalid.invalid> wrote:
On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually >>>> correspond to the end of the previous instruction,
Oh, yes, I had always realized that, but dismissed it as far too
wasteful.
Well, Mitch claims average 35 bits per instructions, that means about
90% utilization of decoders, so not bad.
His minimum instruction size is 32 bits, but I was going for 16 bits.
Also, consider that alternative to variable length instructions is to
use longer instructions or more of them.
What I did instead was use variable-length instructions, but add a prefix at the beginning of any 256-bit block of instructions that contained them which directly showed where each instruction began.
My intent was to avoid the disadvantages you identify for fixed-length instructions, but avoid the disadvantage of variable-length instructions too.
I understand your goal, however . . .
How many bits do you "waste" on the prefix?
Since I think any branch must target the beginning of a block, and in general, a routine will not end on a block boundary, there will be
"wasted" bits at the end of the last block before a "label". Have you determined for a "typical" program, how many bits are wasted due to this?
The point I am making is that you will "cancel" at least some of the
savings of 16 bit instructions. You should take this into account
before committing to your plan.
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3} >>>>>>
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken >>>>> here...
when I read this, I thought that there was a standard technique for >>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>> different than any other calculation, except that no mangling of the >>>> bits is going on.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember >>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..
The 2-operand+displacement LD/STs have a lock bit in the instruction--
that
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier >> {A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless impl. If we on another system and that LOCK XADD is some sort of LL/SC "style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>> that
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3} >>>>>>>>
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken >>>>>>> here...
when I read this, I thought that there was a standard technique for >>>>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>>>> different than any other calculation, except that no mangling of the >>>>>> bits is going on.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier >>>> {A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless
impl. If we on another system and that LOCK XADD is some sort of LL/SC
"style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
A::
ATOMIC-to-Memory-size [address]
ADD Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped
out over the memory hierarchy. When the operation runs into a cache containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is typical.}
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
On 12/28/2025 2:04 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>> that
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj}, >>>>>>>>> {R1..R3}
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},
{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm >>>>>>>> mistaken
here...
when I read this, I thought that there was a standard technique for >>>>>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>> different than any other calculation, except that no mangling of the >>>>>>> bits is going on.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I
remember
something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-
Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>> impl. If we on another system and that LOCK XADD is some sort of LL/SC
"style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
A::
-a-a-a-a-a ATOMIC-to-Memory-size-a [address]
-a-a-a-a-a ADD-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped
out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is typical.}
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
John Savard <quadibloc@invalid.invalid> wrote:
On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:Well, Mitch claims average 35 bits per instructions, that means about
It is certainly possible to decode potential instructions at everyOh, yes, I had always realized that, but dismissed it as far too
starting position in parallel, and later select the ones that actually >>>> correspond to the end of the previous instruction,
wasteful.
90% utilization of decoders, so not bad.
His minimum instruction size is 32 bits, but I was going for 16 bits.
Also, consider that alternative to variable length instructions is to
use longer instructions or more of them.
What I did instead was use variable-length instructions, but add a prefix
at the beginning of any 256-bit block of instructions that contained them which directly showed where each instruction began.
My intent was to avoid the disadvantages you identify for fixed-length instructions, but avoid the disadvantage of variable-length instructions too.
John Savard
Well, Mitch claims average 35 bits per instructions, that means aboutHis minimum instruction size is 32 bits, but I was going for 16 bits.
90% utilization of decoders, so not bad.
BTW, my understanding of Mitch's design is that this is related to instruction complexity: if you support 16bit instructions, it means you support instructions which presumably don't do very much work because
it's hard to express a lot of "work to do" in 16bit.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 12/28/2025 8:22 AM, John Savard wrote:
On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
John Savard <quadibloc@invalid.invalid> wrote:
On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:
It is certainly possible to decode potential instructions at every >>>>>> starting position in parallel, and later select the ones that actually >>>>>> correspond to the end of the previous instruction,
Oh, yes, I had always realized that, but dismissed it as far too
wasteful.
Well, Mitch claims average 35 bits per instructions, that means about
90% utilization of decoders, so not bad.
His minimum instruction size is 32 bits, but I was going for 16 bits.
Also, consider that alternative to variable length instructions is to
use longer instructions or more of them.
What I did instead was use variable-length instructions, but add a prefix >>> at the beginning of any 256-bit block of instructions that contained them >>> which directly showed where each instruction began.
My intent was to avoid the disadvantages you identify for fixed-length
instructions, but avoid the disadvantage of variable-length instructions >>> too.
I understand your goal, however . . .
How many bits do you "waste" on the prefix?
For code like:: (undoctored compiler output)
.LBB1_34:
mov r1,#0
br .LBB1_35
.LBB1_39:
mov r1,#28
exit r16,r0,0,32
.LBB1_36:
mov r1,#0
exit r16,r0,0,32
.LBB1_37:
call processRestart
beq0 r1,.LBB1_38
.LBB1_35:
exit r16,r0,0,32
.LBB1_38:
lduh r1,[ip,gRestartsLeft]
br .LBB1_2
.Lfunc_end1:
You are going to eat a lot of headers for 2 instruction BBs.
Also:: Can you return to the middle of a block ??
Can you make multiple CALLs from a single block ??
Can you fit an entire loop in a single block ??
Can you fit a loop with calls in a single block ??
Does each unique label of a switch(i) require its own block ??
Since I think any branch must target the beginning of a block, and in
general, a routine will not end on a block boundary, there will be
"wasted" bits at the end of the last block before a "label". Have you
determined for a "typical" program, how many bits are wasted due to this?
The point I am making is that you will "cancel" at least some of the
savings of 16 bit instructions. You should take this into account
before committing to your plan.
I suspect the block-boundaries will consume more space than the 16-bit instructions can possibly save.
On 12/28/2025 8:22 AM, John Savard wrote:
On Mon, 22 Dec 2025 20:00:06 +0000, Waldek Hebisch wrote:
John Savard <quadibloc@invalid.invalid> wrote:
On Thu, 18 Dec 2025 22:25:08 +0000, Anton Ertl wrote:
It is certainly possible to decode potential instructions at every
starting position in parallel, and later select the ones that actually >>>> correspond to the end of the previous instruction,
Oh, yes, I had always realized that, but dismissed it as far too
wasteful.
Well, Mitch claims average 35 bits per instructions, that means about
90% utilization of decoders, so not bad.
His minimum instruction size is 32 bits, but I was going for 16 bits.
Also, consider that alternative to variable length instructions is to
use longer instructions or more of them.
What I did instead was use variable-length instructions, but add a prefix at the beginning of any 256-bit block of instructions that contained them which directly showed where each instruction began.
My intent was to avoid the disadvantages you identify for fixed-length instructions, but avoid the disadvantage of variable-length instructions too.
I understand your goal, however . . .
How many bits do you "waste" on the prefix?
Since I think any branch must target the beginning of a block, and in general, a routine will not end on a block boundary, there will be
"wasted" bits at the end of the last block before a "label". Have you determined for a "typical" program, how many bits are wasted due to this?
The point I am making is that you will "cancel" at least some of the
savings of 16 bit instructions. You should take this into account
before committing to your plan.
On 12/28/2025 2:04 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>> that
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken
here...
when I read this, I thought that there was a standard technique for >>>>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>>>> different than any other calculation, except that no mangling of the >>>>>> bits is going on.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >> impl. If we on another system and that LOCK XADD is some sort of LL/SC
"style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
A::
ATOMIC-to-Memory-size [address]
ADD Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is performed and line updated. Otherwise, the Add-to-memory #1 is shipped
out over the memory hierarchy. When the operation runs into a cache containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol depends on writeable implying {exclusive or modified} which is typical.}
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/28/2025 2:04 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken
here...
when I read this, I thought that there was a standard technique for >>>>>>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
A::
ATOMIC-to-Memory-size [address]
ADD Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped
out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is typical.} >>>
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
BTW, my understanding of Mitch's design is that this is related toWell, Mitch claims average 35 bits per instructions, that means aboutHis minimum instruction size is 32 bits, but I was going for 16 bits.
90% utilization of decoders, so not bad.
instruction complexity: if you support 16bit instructions, it means you
support instructions which presumably don't do very much work because
it's hard to express a lot of "work to do" in 16bit.
A bit of statistics on that.
Using a primitive Perl script to catch occurences, on a recent
My 66000 cmopiler, of the shape
[op] Ra,Ra,Rb
[op] Ra,Rb,Ra
[op] Ra,#n,Ra
[op] Ra,Ra,#n
[op] Ra,Rb
where |n| < 32, which could be a reasonable approximation of a
compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
and 23.9% (GSL) of such instructions. Potential space savings
would be a bit less than half that.
Better compression schemes are certainly possible, but I think the disadvantages of having more complex encodings outweigh any
potential savings in instruction size.
My thoughts were something along the lines of having fat instructionsSuperSPARC tried this, it does not work "all that well".
like a 3-in 2-out 2-op instruction that does:
Rd1 <= Rs1 OP1 Rs2;
Rd2 <= Rs3 OP2 Rd1;
so your datapath has two ALUs back to back in a single cycle.
One might notice that None of the SPARC generations were anywhere close to the frequency of the more typical RISCs.
Thomas Koenig wrote:--- Synchronet 3.21a-Linux NewsLink 1.2
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
A bit of statistics on that.BTW, my understanding of Mitch's design is that this is related toWell, Mitch claims average 35 bits per instructions, that means about >>>>> 90% utilization of decoders, so not bad.His minimum instruction size is 32 bits, but I was going for 16 bits.
instruction complexity: if you support 16bit instructions, it means you
support instructions which presumably don't do very much work because
it's hard to express a lot of "work to do" in 16bit.
Using a primitive Perl script to catch occurences, on a recent
My 66000 cmopiler, of the shape
[op] Ra,Ra,Rb
[op] Ra,Rb,Ra
[op] Ra,#n,Ra
[op] Ra,Ra,#n
[op] Ra,Rb
where |n| < 32, which could be a reasonable approximation of a
compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
and 23.9% (GSL) of such instructions. Potential space savings
would be a bit less than half that.
Better compression schemes are certainly possible, but I think the
disadvantages of having more complex encodings outweigh any
potential savings in instruction size.
The % associations you measured above might just be coincidence.
I have assumed for a compiler to choose between two instruction formats,
a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
that the register allocator would check if either operand was alive after
the OP, and if not then that source register can be reused as the dest.
For some ISA that may allow a shorter instruction format to be used.
Your stats above assume the compiler is performing this optimization
but since My 66000 does not have short format instructions the compiler
would have no reason to do so. Or the compiler might be doing this optimization anyways for other ISA such as x86/x64 which do have
shorter formats.
So the % numbers you measured might just be coincidence and could be low.
An ISA with both short 2- and long 3- register formats like RV where there
is an incentive to do this optimization might provide stats confirmation.
Thomas Koenig wrote:
Using a primitive Perl script to catch occurences, on a recent
My 66000 cmopiler, of the shape
[op] Ra,Ra,Rb
[op] Ra,Rb,Ra
[op] Ra,#n,Ra
[op] Ra,Ra,#n
[op] Ra,Rb
where |n| < 32, which could be a reasonable approximation of a
compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
and 23.9% (GSL) of such instructions. Potential space savings
would be a bit less than half that.
Better compression schemes are certainly possible, but I think the
disadvantages of having more complex encodings outweigh any
potential savings in instruction size.
So the % numbers you measured might just be coincidence and could be low.
An ISA with both short 2- and long 3- register formats like RV where there
is an incentive to do this optimization might provide stats confirmation.
I wonder if there have been other studies to explore other impacts
such as run time, or cache miss rate.
On 12/28/2025 5:53 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/28/2025 2:04 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken
here...
when I read this, I thought that there was a standard technique for >>>>>>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember
something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
A::
ATOMIC-to-Memory-size [address]
ADD Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>> out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is typical.} >>>
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
IMHO:
No-Cache + CAS is probably a better bet than LL/SC;
LL/sC: Depends on the existence of explicit memory-coherency features. No-Cache + CAS: Can be made to work independent of the underlying memory model.
Granted, No-Cache is its own feature:
Need some way to indicate to the L1 cache that special handling is
needed for this memory access and cache line (that it should not use a previously cached value and should be flushed immediately once the
operation completes).
But, No-Cache behavior is much easier to fake on a TSO capable memory subsystem, than it is to accurately fake LL/SC on top of weak-model write-back caches.
If the memory system implements TSO or similar, then one can simply
ignore the No-Cache behavior and achieve the same effect.
..
Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
BTW, my understanding of Mitch's design is that this is related toWell, Mitch claims average 35 bits per instructions, that means about >>>> 90% utilization of decoders, so not bad.His minimum instruction size is 32 bits, but I was going for 16 bits.
instruction complexity: if you support 16bit instructions, it means you
support instructions which presumably don't do very much work because
it's hard to express a lot of "work to do" in 16bit.
A bit of statistics on that.
Using a primitive Perl script to catch occurences, on a recent
My 66000 cmopiler, of the shape
[op] Ra,Ra,Rb
[op] Ra,Rb,Ra
[op] Ra,#n,Ra
[op] Ra,Ra,#n
[op] Ra,Rb
where |n| < 32, which could be a reasonable approximation of a
compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
and 23.9% (GSL) of such instructions. Potential space savings
would be a bit less than half that.
Better compression schemes are certainly possible, but I think the disadvantages of having more complex encodings outweigh any
potential savings in instruction size.
The % associations you measured above might just be coincidence.
I have assumed for a compiler to choose between two instruction formats,
a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
that the register allocator would check if either operand was alive after
the OP, and if not then that source register can be reused as the dest.
For some ISA that may allow a shorter instruction format to be used.
Your stats above assume the compiler is performing this optimization
but since My 66000 does not have short format instructions the compiler
would have no reason to do so. Or the compiler might be doing this optimization anyways for other ISA such as x86/x64 which do have
shorter formats.
So the % numbers you measured might just be coincidence and could be low.
An ISA with both short 2- and long 3- register formats like RV where there
is an incentive to do this optimization might provide stats confirmation.
I wonder if there have been other studies to explore other impactsThe difficulty there is standardising the input data, and normalising processor performance, memory bandwidth and latency, etc.
such as run time, or cache miss rate.
Code segment size is much easier to measure.
BTW, when discussing ISA compactness, I usually see it measured by
comparing the size of the code segment in typical executables.
I understand that it's as good a measure as any and it's one that's
fairly easily available, but at the same time it's not necessarily one
that actually matters since I expect that it affects a usually fairly
small proportion of the total ROM/Flash/disk space.
I wonder if there have been other studies to explore other impacts such
as run time, or cache miss rate.
Stefan
EricP [2025-12-29 09:54:30] wrote:--- Synchronet 3.21a-Linux NewsLink 1.2
Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
A bit of statistics on that.instruction complexity: if you support 16bit instructions, it means you >>> support instructions which presumably don't do very much work becauseWell, Mitch claims average 35 bits per instructions, that means about >>>>> 90% utilization of decoders, so not bad.His minimum instruction size is 32 bits, but I was going for 16 bits. >>> BTW, my understanding of Mitch's design is that this is related to
it's hard to express a lot of "work to do" in 16bit.
Using a primitive Perl script to catch occurences, on a recent
My 66000 cmopiler, of the shape
[op] Ra,Ra,Rb
[op] Ra,Rb,Ra
[op] Ra,#n,Ra
[op] Ra,Ra,#n
[op] Ra,Rb
where |n| < 32, which could be a reasonable approximation of a
compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
and 23.9% (GSL) of such instructions. Potential space savings
would be a bit less than half that.
Better compression schemes are certainly possible, but I think the
disadvantages of having more complex encodings outweigh any
potential savings in instruction size.
The % associations you measured above might just be coincidence.
I have assumed for a compiler to choose between two instruction formats,
a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
that the register allocator would check if either operand was alive after the OP, and if not then that source register can be reused as the dest.
For some ISA that may allow a shorter instruction format to be used.
Your stats above assume the compiler is performing this optimization
but since My 66000 does not have short format instructions the compiler would have no reason to do so. Or the compiler might be doing this optimization anyways for other ISA such as x86/x64 which do have
shorter formats.
So the % numbers you measured might just be coincidence and could be low. An ISA with both short 2- and long 3- register formats like RV where there is an incentive to do this optimization might provide stats confirmation.
My thoughts were something along the lines of having fat instructionsSuperSPARC tried this, it does not work "all that well".
like a 3-in 2-out 2-op instruction that does:
Rd1 <= Rs1 OP1 Rs2;
Rd2 <= Rs3 OP2 Rd1;
so your datapath has two ALUs back to back in a single cycle.
Do you have a reference to that? I can't see any trace of that in the
SPARC ISA, so I assume it was done via instruction fusion instead?
One might notice that None of the SPARC generations were anywhere close to the frequency of the more typical RISCs.
Hmm... I remember Sun being slower to move to OoO, but in terms of
frequency I thought they were mostly on par with other RISCs of the time
(and back then, SPARC was one of the top two "typical RISCs", AFAIK).
Stefan--- Synchronet 3.21a-Linux NewsLink 1.2
EricP <ThatWouldBeTelling@thevillage.com> writes:
Thomas Koenig wrote:
Using a primitive Perl script to catch occurences, on a recent
My 66000 cmopiler, of the shape
[op] Ra,Ra,Rb
[op] Ra,Rb,Ra
[op] Ra,#n,Ra
[op] Ra,Ra,#n
[op] Ra,Rb
where |n| < 32, which could be a reasonable approximation of a
compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
and 23.9% (GSL) of such instructions. Potential space savings
would be a bit less than half that.
Better compression schemes are certainly possible, but I think the
disadvantages of having more complex encodings outweigh any
potential savings in instruction size.
The RISC-V people brag about how little their compressed encoding
costs to decode; IIRC it's in the hundreds of something (not sure if transistors or gates). Of course, with superscalar decoding the
compressed instruction set costs additional decoders plus logic to
select which decodings do not belong to actual instructions, but
that's true for any 16+32-bit encoding, however simple.
Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
BTW, my understanding of Mitch's design is that this is related toWell, Mitch claims average 35 bits per instructions, that means about >>>>> 90% utilization of decoders, so not bad.His minimum instruction size is 32 bits, but I was going for 16 bits.
instruction complexity: if you support 16bit instructions, it means you
support instructions which presumably don't do very much work because
it's hard to express a lot of "work to do" in 16bit.
A bit of statistics on that.
Using a primitive Perl script to catch occurences, on a recent
My 66000 cmopiler, of the shape
[op] Ra,Ra,Rb
[op] Ra,Rb,Ra
[op] Ra,#n,Ra
[op] Ra,Ra,#n
[op] Ra,Rb
where |n| < 32, which could be a reasonable approximation of a
compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
and 23.9% (GSL) of such instructions. Potential space savings
would be a bit less than half that.
Better compression schemes are certainly possible, but I think the
disadvantages of having more complex encodings outweigh any
potential savings in instruction size.
The % associations you measured above might just be coincidence.
I have assumed for a compiler to choose between two instruction formats,
a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
that the register allocator would check if either operand was alive after
the OP, and if not then that source register can be reused as the dest.
For some ISA that may allow a shorter instruction format to be used.
Your stats above assume the compiler is performing this optimization
but since My 66000 does not have short format instructions the compiler
would have no reason to do so. Or the compiler might be doing this optimization anyways for other ISA such as x86/x64 which do have
shorter formats.
So the % numbers you measured might just be coincidence and could be low.
An ISA with both short 2- and long 3- register formats like RV where there
is an incentive to do this optimization might provide stats confirmation.
Do you have a reference to that? I can't see any trace of that in theIt is not in ISA, and it is not "like" instruction Fusion, either.
SPARC ISA, so I assume it was done via instruction fusion instead?
When a first instruction had a property*, and a second instruction also
had a certain property*, they would be issued together into the execution pipeline. The first instruction executes in the first cycle, the second instruction in the second cycle with forwarding of the result of the first
to the second.
On 12/28/2025 5:53 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/28/2025 2:04 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj}, >>>>>>>>>>> {R1..R3}
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},
{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within >>>>>>>>>>> the
constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm >>>>>>>>>> mistaken
here...
when I read this, I thought that there was a standard
technique for
doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That >>>>>>>>> is no
different than any other calculation, except that no mangling >>>>>>>>> of the
bits is going on.
-a -a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the
fancy
instructions into RISC-style pseudo-ops. But apparently, since >>>>>>>>>> you
would
know all about that, there must be a reason why it doesn't >>>>>>>>>> apply in
these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I >>>>>>>> remember
something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, >>>>>>>> ect..
The 2-operand+displacement LD/STs have a lock bit in the
instruction--
that
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>>
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction- >>>>>>> Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a
loopless
impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>>> "style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
A::
-a-a-a-a-a-a ATOMIC-to-Memory-size-a [address]
-a-a-a-a-a-a ADD-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>>> out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is
typical.}
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under
contention. So you DON"T DO IT LIKE THAT.
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
IMHO:
No-Cache + CAS is probably a better bet than LL/SC;
LL/sC: Depends on the existence of explicit memory-coherency features. No-Cache + CAS: Can be made to work independent of the underlying memory model.
Granted, No-Cache is its own feature:
Need some way to indicate to the L1 cache that special handling is
needed for this memory access and cache line (that it should not use a previously cached value and should be flushed immediately once the
operation completes).
But, No-Cache behavior is much easier to fake on a TSO capable memory subsystem, than it is to accurately fake LL/SC on top of weak-model write-back caches.
If the memory system implements TSO or similar, then one can simply
ignore the No-Cache behavior and achieve the same effect.
...
Fwiw, I noticed that a certain compiler was implementing LOCK XADD with
a LOCK CMPXCHG loop and got a little pissed. Had to tell them about it:
read all when you get some free time to burn:
https://forum.pellesc.de/index.php?topic=7167.msg27217#msg27217
On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
Or in other words, if you can decode K-instructions per cycle, you'd
better be able to execute K-instructions per cycle--or you have a
serious blockage in your pipeline.
Not a typo--the part of the pipeline which is <dynamically> narrowest is
the part that limits performance. I suggest strongly that you should not make/allow the decoder to play that part.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Thomas Koenig wrote:
Using a primitive Perl script to catch occurences, on a recent
My 66000 cmopiler, of the shape
[op] Ra,Ra,Rb
[op] Ra,Rb,Ra
[op] Ra,#n,Ra
[op] Ra,Ra,#n
[op] Ra,Rb
where |n| < 32, which could be a reasonable approximation of a
compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
and 23.9% (GSL) of such instructions. Potential space savings
would be a bit less than half that.
Better compression schemes are certainly possible, but I think the
disadvantages of having more complex encodings outweigh any
potential savings in instruction size.
The RISC-V people brag about how little their compressed encoding
costs to decode; IIRC it's in the hundreds of something (not sure if transistors or gates). Of course, with superscalar decoding the
compressed instruction set costs additional decoders plus logic to
select which decodings do not belong to actual instructions, but
that's true for any 16+32-bit encoding, however simple.
So the % numbers you measured might just be coincidence and could be low.
An ISA with both short 2- and long 3- register formats like RV where there >> is an incentive to do this optimization might provide stats confirmation.
I have done the following on a RV64GC system with Fedora 33:
objdump -d /lib64/lp64d/libperl.so.5.32|grep '^ *[0-9a-f]*:'|awk '{print length($2)}'|sort|uniq -c
215782 4
179493 8
16-bit instructions are reported as 4 (4 hex digits), 32-bit
instructions are reported as 8.
If the actual binary /usr/bin/perl is meant, here's the stats for that:
objdump -d /usr//bin/perl|grep '^ *[0-9a-f]*:'|awk '{print length($2)}'|sort|uniq -c
105 4
167 8
gnuplot is not installed, and GSL is not installed, either, whatever
it may be.
Just to widen the basis, here are a few more:
zstd:
129569 4
134985 8
git:
305090 4
274053 8
/usr/lib64/libc-2.32.so:
142208 4
113455 8
So the percentage of 16-bit instructions is a lot higher than for the
schemes that Thomas Koenig has looked at.
Another way to approach this question is to look at the current
champion of fixed instruction width, ARM A64, consider those
instructions (and addressing modes) that ARM A64 has and RISC-V does
not have, and look at how often they are used, and how many RISC-V instructions are needed to replace them.
In any case, code density measurements show that both result in
compact code, with RV64GC having more compact code, and actually
having the most compact code among the architectures present in all
rounds of my measurements where RV64GC was present.
But code size is not everything. For ARM A64, you pay for it by the increased complexity of implementing these instructions (in particular
the many register ports) and addressing modes. For bigger
implementations, instruction combining means additional front-end
effort for RISC-V, and then maybe similar implementation effort for
the combined instructions as for ARM A64 (but more flexibility in
selecting which instructions to combine). And, as mentioned above,
the additional decoding effort.
When we look at actual implementations, RISC-V has not reached the
widths that ARM A64 has reached, but I guess that this is more due to
the current potential markets for these two architectures than due to technical issues. RISC-V seems to be pushing into server space
lately, so we may see wider implementations in the not-too-far future.
- anton
I wonder if there have been other studies to explore other impactsThe difficulty there is standardising the input data, and normalising
such as run time, or cache miss rate.
processor performance, memory bandwidth and latency, etc.
I was thinking of those "compressed" variants of ISAs, such as Thumb,
Thumb2, MIPS16e, microMIPS, or the "C" option of RISC-V, where you can compare with/without on the very same machine since all the half-size instructions are also available in full-size.
Code segment size is much easier to measure.
Yes, but!
Stefan
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller
register space, one needs to reuse registers more frequently, which then >reduces ILP due to register conflicts. So, smaller code at the expense
of worse performance.
Things like ALU status flags aren't free either.
Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.
Major limitations here being more:
Things like register forwarding cost have non-linear scaling;
For an in-order machine, usable ILP drops off very rapidly;
...
There seems to be a local optimum between 2 and 3.
Say, for example, if one had an in-order machine with 5 ALUs, one would
be hard pressed to find much code that could actually make use of the 5 >ALUs. One can sorta make use of 3 ALUs, but even then, the 3rd lane is
more often useful for spare register ports and similar (with 3-wide ALU >being a minority case)
My 66000 does not have a TSO memory system, but when one of these
things shows up, it goes sequential consistency, and when it is done
it flips back to causal consistency.
TSO is cycle-wasteful.
BTW, when discussing ISA compactness, I usually see it measured by
comparing the size of the code segment in typical executables.
I understand that it's as good a measure as any and it's one that's
fairly easily available, but at the same time it's not necessarily one
that actually matters since I expect that it affects a usually fairly
small proportion of the total ROM/Flash/disk space.
I wonder if there have been other studies to explore other impacts such
as run time, or cache miss rate.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
TSO is cycle-wasteful.
This has been experimentally verified for Apple Silicon:
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
BGB <cr88192@gmail.com> writes:
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller
register space, one needs to reuse registers more frequently, which
then reduces ILP due to register conflicts. So, smaller code at the
expense of worse performance.
For designs like RISC-V C and Thumb2, there is always the option to
use the uncompressed instruction. So you may tune your RISC-V
compiler to prefer registers r8-r15 for those pseudo-registers that
occur in instructions where such a register allocation may lead to a compressed encoding.
Write-after-read and write-after-write does not reduce the IPC of OoO implementations. On the contrary, write-after-read may be beneficial
by releasing the old physical register for the register name. And
designing a compressed CPU instruction set for in-order processing is
not a good idea for general-purpose computing.
Things like ALU status flags aren't free either.
Yes, they cost their own renaming resources.
Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.
Major limitations here being more:
Things like register forwarding cost have non-linear scaling;
For an in-order machine, usable ILP drops off very rapidly;
...
ILP is a property of a program. I assume that what you mean is that
the IPC benefits of more width have quickly diminishing returns on
in-order machines.
There seems to be a local optimum between 2 and 3.
Say, for example, if one had an in-order machine with 5 ALUs, one
would be hard pressed to find much code that could actually make use
of the 5 ALUs. One can sorta make use of 3 ALUs, but even then, the
3rd lane is more often useful for spare register ports and similar
(with 3-wide ALU being a minority case)
We have some interesting case studies: The Alpha 21164(a) and the ARM Cortex-A53 and A55. They all are in-order designs, their number of functional units are pretty similar, and, in particular, they all have
2 integer ALUs. But the 21164 can decode and execute 4 instructions
per cycle, while the Cortex-A53 and A55 are only two-wide. My guess
is that this is due to the decoding cost of ARM A32/T32 and A64
(decoders for two instruction sets, one of which has 16-bit and 32-bit instructions).
The Cortex-A55 was succeeded by the A510, which is three-wide, and
that was succeeded by the A520, which is three-wide with two ALUs and supports only ARM A64.
Widening the A510, which still supports both instruction sets is
(weak) counterevidence for my theory about why A53/A55 are only
two-wide at decoding. The fact that the A520 returns to two integer
ALUs indicates that the third integer ALU provides little IPC benefit
in an in-order design.
- anton
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
My 66000 does not have a TSO memory system, but when one of these
things shows up, it goes sequential consistency, and when it is done
it flips back to causal consistency.
TSO is cycle-wasteful.
This has been experimentally verified for Apple Silicon:
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
Thomas Koenig <tkoenig@netcologne.de> writes:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
TSO is cycle-wasteful.
This has been experimentally verified for Apple Silicon:
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
"All swans are white" has been "experimentally verified" by finding
one white swan. The existence of black swans shows that such
"experimental verifications" are fallacies.
Actually the existence of a weak-memory-model mode on specific
hardware makes it very likely that TSO is slower than the weak model
on that hardware. If TSO was implemented to provide the same speed,
there would be no need for also providing a weaker memory ordering
mode on that hardware.
Similarly, the introduction of the TRAPB instruction on the Alpha, and
the fact that using it through -mieee-fp slows down execution on the 21064-21164A could be construed as "experimental verification" of the
claims "IEEE FP (in particular denormal numbers) is cycle-wasteful"
and "precise FP exceptions are cycle-wasteful". Then the black swan
(21264) appeared, where TRAPB is a noop, and it outperformed all
earlier Alphas.
Note that "some Nvidia and Fujitsu [ARM architecture] implementations
run with TSO at all times" <https://lwn.net/Articles/970907/>. Given
that the Fujitsu implementations of the ARM architecture are used in supercomputers, it is unlikely that their TSO implementation is cycle-wasteful.
- anton
Do you happen to have benchmarks that compare performance of Alpha EV5
vs in-order Cortex-A ?
Thomas Koenig <tkoenig@netcologne.de> writes:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
TSO is cycle-wasteful.
This has been experimentally verified for Apple Silicon:
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
"All swans are white" has been "experimentally verified" by finding one
white swan. The existence of black swans shows that such
"experimental verifications" are fallacies.
Actually the existence of a weak-memory-model mode on specific
hardware makes it very likely that TSO is slower than the weak model
on that hardware. If TSO was implemented to provide the same speed,
there would be no need for also providing a weaker memory ordering
mode on that hardware.
Similarly, the introduction of the TRAPB instruction on the Alpha, and
the fact that using it through -mieee-fp slows down execution on the 21064-21164A could be construed as "experimental verification" of the
claims "IEEE FP (in particular denormal numbers) is cycle-wasteful"
and "precise FP exceptions are cycle-wasteful". Then the black swan
(21264) appeared, where TRAPB is a noop, and it outperformed all
earlier Alphas.
Note that "some Nvidia and Fujitsu [ARM architecture] implementations
run with TSO at all times" <https://lwn.net/Articles/970907/>. Given
that the Fujitsu implementations of the ARM architecture are used in supercomputers, it is unlikely that their TSO implementation is cycle-wasteful.
Michael S <already5chosen@yahoo.com> writes:
Do you happen to have benchmarks that compare performance of Alpha
EV5 vs in-order Cortex-A ?
LaTeX benchmark results (lower is waster)
Alpha:
- 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5) 8.1
ARM A32/T32:
- Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
ARM A64:
- Rockpro64 (1416MHz Cortex A53) Debian 9 (Stretch) 3.24
- Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04
2.488
- Odroid C2 (1536MHz Cortex A53) Ubuntu 16.04 2.32
- Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended)
2.105
A problem with the LaTeX benchmark is that it performance is
significantly influenced by the LaTeX installation (newer versions
need more instructions, and having more packages needs more
instructions). But it's the only benchmark results I have.
- anton
On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
My 66000 does not have a TSO memory system, but when one of theseThis has been experimentally verified for Apple Silicon:
things shows up, it goes sequential consistency, and when it is done
it flips back to causal consistency.
TSO is cycle-wasteful.
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
WOW, they wrote article of 7 pages without even one time mentioning
avoidance of RFO (read for ownership) which is an elephant in the room
of discussion of advantages of Arm MOM/MCM over TSO.
Michael S wrote:
On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
My 66000 does not have a TSO memory system, but when one of theseThis has been experimentally verified for Apple Silicon:
things shows up, it goes sequential consistency, and when it is
done it flips back to causal consistency.
TSO is cycle-wasteful.
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
WOW, they wrote article of 7 pages without even one time mentioning avoidance of RFO (read for ownership) which is an elephant in the
room of discussion of advantages of Arm MOM/MCM over TSO.
What is this "avoidance of RFO"?
I can find no mention of it anywhere.
Imagine code that overwrites the whole cache line that core initially
did not own. Under TSO rules, like x86, the only [non heroic] ways to >overwrite the line without reading its previous content (which could
easily mean reading from DRAM) are
- aligned AVX512 store
- rep movs/rep stos
Michael S <already5chosen@yahoo.com> writes:
Imagine code that overwrites the whole cache line that core initiallyAny sequence of stores without intervening loads can be turned into
did not own. Under TSO rules, like x86, the only [non heroic] ways to >>overwrite the line without reading its previous content (which could
easily mean reading from DRAM) are
- aligned AVX512 store
- rep movs/rep stos
one store under sequential consistency, and therefore also under the
weaker TSO. Doing that for a sequence that stores into one cache line
does not appear particularly heroic to me. The question is how much
benefit one gets from this optimization.
On 12/28/2025 4:41 PM, BGB wrote:
[...]
Also, if using something like LOCK CMPXCHG you MUST make sure to align
and pad your relevant data structures to a l2 cache line.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
TSO is cycle-wasteful.
This has been experimentally verified for Apple Silicon:
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
"All swans are white" has been "experimentally verified" by finding one
white swan. The existence of black swans shows that such
"experimental verifications" are fallacies.
If you have any counter-examples, please feel free to cite them.
Actually the existence of a weak-memory-model mode on specific
hardware makes it very likely that TSO is slower than the weak model
on that hardware. If TSO was implemented to provide the same speed,
there would be no need for also providing a weaker memory ordering
mode on that hardware.
That makes little sense. TSO fulfulls all the requirements of the
ARM memory model, and adds some on top. It is possible to create
an ARM CPU which uses TSO, as you wrote below. If that had
the same performance as a CPU running on the pure ARM memory model,
there would be no reason to implement the ARM memory model at all.
So, two alternatives:
a) Apple engineers did not have the first clue what they were doing
b) Apple engineers knew what they were doing
Given that Apple silicon seems to be competently done
I personally
think that option a) is the better one. You obviously prefer
option b).
Note that "some Nvidia and Fujitsu [ARM architecture] implementations
run with TSO at all times" <https://lwn.net/Articles/970907/>. Given
that the Fujitsu implementations of the ARM architecture are used in
supercomputers, it is unlikely that their TSO implementation is
cycle-wasteful.
So, you're saying that Apple is clueless, and that Nvidia and Fujitsu
make that decision solely on the basis of speed?
On Tue, 30 Dec 2025 10:44:10 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Michael S wrote:
On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)What is this "avoidance of RFO"?
Thomas Koenig <tkoenig@netcologne.de> wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:WOW, they wrote article of 7 pages without even one time mentioning
My 66000 does not have a TSO memory system, but when one of theseThis has been experimentally verified for Apple Silicon:
things shows up, it goes sequential consistency, and when it is
done it flips back to causal consistency.
TSO is cycle-wasteful.
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
avoidance of RFO (read for ownership) which is an elephant in the
room of discussion of advantages of Arm MOM/MCM over TSO.
I can find no mention of it anywhere.
Imagine code that overwrites the whole cache line that core initially
did not own. Under TSO rules, like x86, the only [non heroic] ways to overwrite the line without reading its previous content (which could
easily mean reading from DRAM) are
- aligned AVX512 store
- rep movs/rep stos
With weaker MOM the core has option of delayed merging of multiple
narrow stores. I think that even relatively old ARM cores, like Neoverse
N1, are able to do it.
I can imagine heroic microarchitecture that achieves the same effect
with TSO, but it seems that so far nobody did it.
On Tue, 30 Dec 2025 11:13:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
Do you happen to have benchmarks that compare performance of Alpha
EV5 vs in-order Cortex-A ?
LaTeX benchmark results (lower is waster)
Alpha:
- 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5) 8.1
ARM A32/T32:
- Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
ARM A64:
- Rockpro64 (1416MHz Cortex A53) Debian 9 (Stretch) 3.24
- Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04
2.488
- Odroid C2 (1536MHz Cortex A53) Ubuntu 16.04 2.32
- Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended)
2.105
A problem with the LaTeX benchmark is that it performance is
significantly influenced by the LaTeX installation (newer versions
need more instructions, and having more packages needs more
instructions). But it's the only benchmark results I have.
- anton
Thank you.
Two 64-bit A53 results about the same as EV5 clock for clock and one
result is significantly better.
So, either wide in-order is indeed not bright idea or 21164 suffers
because of inferioriority of Alpha ISA relatively to ARM64.
BTW, Odroid C2 score appears suspiciously good. Could it be a turbo
clock frequency was much higher than reported?
BGB <cr88192@gmail.com> posted:
On 12/28/2025 5:53 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/28/2025 2:04 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>>>> that
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>>>>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken
here...
when I read this, I thought that there was a standard technique for >>>>>>>>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>>>> bits is going on.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>>>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember
something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>>>
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>>>
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>>>> "style" loop, well, that causes damage to my loopless claim... ;^o >>>>>>
So, can your system get wait free semantics for RMW atomics?
A::
ATOMIC-to-Memory-size [address]
ADD Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>>>> out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and >>>>> the previous value returned. If [address] is not writeable the cache >>>>> line in invalidated and the search continues outward. {This protocol >>>>> depends on writeable implying {exclusive or modified} which is typical.} >>>>>
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under
contention. So you DON"T DO IT LIKE THAT.
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
IMHO:
No-Cache + CAS is probably a better bet than LL/SC;
LL/sC: Depends on the existence of explicit memory-coherency features.
No-Cache + CAS: Can be made to work independent of the underlying memory
model.
Granted, No-Cache is its own feature:
Need some way to indicate to the L1 cache that special handling is
needed for this memory access and cache line (that it should not use a
previously cached value and should be flushed immediately once the
operation completes).
But, No-Cache behavior is much easier to fake on a TSO capable memory
subsystem, than it is to accurately fake LL/SC on top of weak-model
write-back caches.
My 66000 does not have a TSO memory system, but when one of these
things shows up, it goes sequential consistency, and when it is done
it flips back to causal consistency.
TSO is cycle-wasteful.
If the memory system implements TSO or similar, then one can simply
ignore the No-Cache behavior and achieve the same effect.
..
On Sun, 21 Dec 2025 20:32:44 +0000, MitchAlsup wrote:
On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
Or in other words, if you can decode K-instructions per cycle, you'd
better be able to execute K-instructions per cycle--or you have a
serious blockage in your pipeline.
Not a typo--the part of the pipeline which is <dynamically> narrowest is
the part that limits performance. I suggest strongly that you should not
make/allow the decoder to play that part.
I agree - and strongly, too - that the decoder ought not to be the part
that limits performance.
But what I quoted says that the execution unit ought not to be the part
that limits performance, with the implication that it's OK if the decoder does instead. That's why I said it must be a typo.
So I think you need to look a second time at what you wrote; it's natural for people to see what they expect to see, and so I think you looked at
it, and didn't see the typo that was there.
John Savard
On Tue, 30 Dec 2025 10:44:10 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Michael S wrote:
On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
My 66000 does not have a TSO memory system, but when one of theseThis has been experimentally verified for Apple Silicon:
things shows up, it goes sequential consistency, and when it is
done it flips back to causal consistency.
TSO is cycle-wasteful.
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
WOW, they wrote article of 7 pages without even one time mentioning avoidance of RFO (read for ownership) which is an elephant in the
room of discussion of advantages of Arm MOM/MCM over TSO.
What is this "avoidance of RFO"?
I can find no mention of it anywhere.
Imagine code that overwrites the whole cache line that core initially
did not own. Under TSO rules, like x86, the only [non heroic] ways to overwrite the line without reading its previous content (which could
easily mean reading from DRAM) are
- aligned AVX512 store
- rep movs/rep stos
Michael S wrote:
On Tue, 30 Dec 2025 10:44:10 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Michael S wrote:
On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)What is this "avoidance of RFO"?
Thomas Koenig <tkoenig@netcologne.de> wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:WOW, they wrote article of 7 pages without even one time mentioning
My 66000 does not have a TSO memory system, but when one of theseThis has been experimentally verified for Apple Silicon:
things shows up, it goes sequential consistency, and when it is
done it flips back to causal consistency.
TSO is cycle-wasteful.
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf >>>>
avoidance of RFO (read for ownership) which is an elephant in the
room of discussion of advantages of Arm MOM/MCM over TSO.
I can find no mention of it anywhere.
Imagine code that overwrites the whole cache line that core initially
did not own. Under TSO rules, like x86, the only [non heroic] ways to overwrite the line without reading its previous content (which could
easily mean reading from DRAM) are
- aligned AVX512 store
- rep movs/rep stos
With weaker MOM the core has option of delayed merging of multiple
narrow stores. I think that even relatively old ARM cores, like Neoverse N1, are able to do it.
I can imagine heroic microarchitecture that achieves the same effect
with TSO, but it seems that so far nobody did it.
I don't see how a ReadForOwnership message can be avoided as it
transfers two things: the ownership state, and the current line data.
Even if the core knows the whole cache line is being overwritten and
doesn't need the line data, it still needs the Owned state transfer.
There would still be a request message, say TakeOwner TKO which
has a smaller reply GiveOwner GVO message and just moves the state.
So the reply is a few less flits.
As I understand it...
Independent of the ReadForOwnership message, the ARM weak coherence model should allow stores to other cache lines to proceed, whereas TSO would require younger stores to (appear to) wait until the older store completes.
The weak coherence model allows the cache to use hit-under-miss for
stores because it doesn't require the store order to different locations
be seen in program order. This allows it to overlap younger store cache
hits with the older ReadForOwnership message, not eliminate it.
On 12/29/2025 1:55 PM, MitchAlsup wrote:
----merciful snip--------------
My 66000 does not have a TSO memory system, but when one of these
things shows up, it goes sequential consistency, and when it is done
it flips back to causal consistency.
TSO is cycle-wasteful.
But, yeah, was not arguing for using TSO here, rather noting that if one
has it, then No-Cache can be ignored for CAS.
But, then again, weak model is cheaper to implement and generally
faster, although explicit synchronization is annoying and such a model
is incompatible with "lock free data structures" (which tend to
implicitly assume that memory accesses occur in the same order as
written and that any memory stores are immediately visible across threads).
But, then again, one is left with one of several options:
Ask that people use a mutex whenever accessing any resource that may be modified between threads and where such modifications are functionally important;
Or, alternatively, use a message passing scheme, where message passing
can potentially be done more cheaply than a mutex (can be done in
premise using No-Cache memory rather than needing an L1 flush).
Well, or use write-through caching, which would mostly have the merit of allowing for cheaper cache flushes (at the expense of being slower in general than write-back caching).
In the case of RISC-V, there are the FENCE and FENCE.I instructions.
In my implementation, they are user-land only, and need to be
implemented as, say:
FENCE traps, and then performs an L1 flush.
FENCE.I traps, and then performs an L1 flush,
then also flushes the I$.
There is CBO, which allows:
CBO.FLUSH Reg
Where, Reg gives the address of a cache line, which is then flushed from
the L1 D$. This at least mapped over.
John Savard wrote:
On Sun, 21 Dec 2025 20:32:44 +0000, MitchAlsup wrote:
On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
Or in other words, if you can decode K-instructions per cycle, you'd >>>> better be able to execute K-instructions per cycle--or you have a
serious blockage in your pipeline.
Not a typo--the part of the pipeline which is <dynamically> narrowest is >> the part that limits performance. I suggest strongly that you should not >> make/allow the decoder to play that part.
I agree - and strongly, too - that the decoder ought not to be the part that limits performance.
But what I quoted says that the execution unit ought not to be the part that limits performance, with the implication that it's OK if the decoder does instead. That's why I said it must be a typo.
So I think you need to look a second time at what you wrote; it's natural for people to see what they expect to see, and so I think you looked at it, and didn't see the typo that was there.
John Savard
There are two kinds of stalls:
stalls in the serial front end I-cache, Fetch or Decode stages because
of *too little work* (starvation due to input latency),
and stalls in the back end Execute or Writeback stages because
of *too much work* (resource exhaustion).
The front end stalls inject bubbles into the pipeline,
whereas back end stalls can allow younger bubbles to be compressed out.
If I have to stall, I want it in the back end.
It has to do with catching up after a stall.
If a core stalls for 3 clocks, then in order to average 1 IPC--- Synchronet 3.21a-Linux NewsLink 1.2
it must retire 2 instructions per clock for the next 3 clocks.
And it can only do that if it has a backlog of work ready to execute.
BGB <cr88192@gmail.com> writes:
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller
register space, one needs to reuse registers more frequently, which then
reduces ILP due to register conflicts. So, smaller code at the expense
of worse performance.
For designs like RISC-V C and Thumb2, there is always the option to
use the uncompressed instruction. So you may tune your RISC-V
compiler to prefer registers r8-r15 for those pseudo-registers that
occur in instructions where such a register allocation may lead to a compressed encoding.
Write-after-read and write-after-write does not reduce the IPC of OoO implementations. On the contrary, write-after-read may be beneficial
by releasing the old physical register for the register name. And
designing a compressed CPU instruction set for in-order processing is
not a good idea for general-purpose computing.
Things like ALU status flags aren't free either.
Yes, they cost their own renaming resources.
Not particularly hard to go 3-wide or similar on an FPGA with RISC-V.
Major limitations here being more:
Things like register forwarding cost have non-linear scaling;
For an in-order machine, usable ILP drops off very rapidly;
...
ILP is a property of a program. I assume that what you mean is that
the IPC benefits of more width have quickly diminishing returns on
in-order machines.
There seems to be a local optimum between 2 and 3.
Say, for example, if one had an in-order machine with 5 ALUs, one would
be hard pressed to find much code that could actually make use of the 5
ALUs. One can sorta make use of 3 ALUs, but even then, the 3rd lane is
more often useful for spare register ports and similar (with 3-wide ALU
being a minority case)
We have some interesting case studies: The Alpha 21164(a) and the ARM Cortex-A53 and A55. They all are in-order designs, their number of functional units are pretty similar, and, in particular, they all have
2 integer ALUs. But the 21164 can decode and execute 4 instructions
per cycle, while the Cortex-A53 and A55 are only two-wide. My guess
is that this is due to the decoding cost of ARM A32/T32 and A64
(decoders for two instruction sets, one of which has 16-bit and 32-bit instructions).
The Cortex-A55 was succeeded by the A510, which is three-wide, and
that was succeeded by the A520, which is three-wide with two ALUs and supports only ARM A64.
Widening the A510, which still supports both instruction sets is
(weak) counterevidence for my theory about why A53/A55 are only
two-wide at decoding. The fact that the A520 returns to two integer
ALUs indicates that the third integer ALU provides little IPC benefit
in an in-order design.
- anton
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 12/28/2025 4:41 PM, BGB wrote:
[...]
Also, if using something like LOCK CMPXCHG you MUST make sure to align
and pad your relevant data structures to a l2 cache line.
That may not be necessary if there is otherwise no false sharing in
the same cache line. Yes, the operand should be naturally aligned,
(which ensures it is entirely contained within a single cache line),
but there's no reason that other data cannot be stored in the same
cache line, so long as it is unlikely to be accessed by a competing
thread.
On 12/30/2025 1:36 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller
register space, one needs to reuse registers more frequently, which then >> reduces ILP due to register conflicts. So, smaller code at the expense
of worse performance.
For designs like RISC-V C and Thumb2, there is always the option to
use the uncompressed instruction. So you may tune your RISC-V
compiler to prefer registers r8-r15 for those pseudo-registers that
occur in instructions where such a register allocation may lead to a compressed encoding.
Write-after-read and write-after-write does not reduce the IPC of OoO implementations. On the contrary, write-after-read may be beneficial
by releasing the old physical register for the register name. And designing a compressed CPU instruction set for in-order processing is
not a good idea for general-purpose computing.
Though, the main places where compressed instructions are likely to
bring meaningful benefit, is on small in-order machines.
Any OoO machine is also likely to have a lot of RAM and a decent sized
I$, so much of any benefit is likely to go away in this case.
ILP is a property of a program. I assume that what you mean is that
the IPC benefits of more width have quickly diminishing returns on
in-order machines.
The ILP is a property of the code, yes, but how much exists, and how
much of it is actually usable, is effected by the processor implementation.
I did not write anything about the clue of Apple. I don't know much
about the CPUs by Nvidia and Fujitsu. But if there was significant performance to be had by adding a weakly-ordered mode, wouldn't
especially Fujitsu with its supercomputer target have done it?
BGB <cr88192@gmail.com> posted:
Though, the main places where compressed instructions are likely to
bring meaningful benefit, is on small in-order machines.
Coincidentally; this is exactly where a fatter-ISA wins big::
compare::
LDD R7,[IP,R3<<3,DISP32]
1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against
AUPIC Rt,lo(DISP32)
SLL Ri,R3,#3
ADD Rt,Rt,hi(DISP32)
ADD Rt,Rt,Ri
LDD R7,0(Rt)
5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit
minimum.
Any OoO machine is also likely to have a lot of RAM and a decent
sized I$, so much of any benefit is likely to go away in this case.
On 12/30/2025 12:00 PM, Scott Lurndal wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 12/28/2025 4:41 PM, BGB wrote:
[...]
Also, if using something like LOCK CMPXCHG you MUST make sure to align
and pad your relevant data structures to a l2 cache line.
That may not be necessary if there is otherwise no false sharing in
the same cache line.-a-a Yes, the operand should be naturally aligned,
(which ensures it is entirely contained within a single cache line),
but there's no reason that other data cannot be stored in the same
cache line, so long as it is unlikely to be accessed by a competing
thread.
Yes, or the "small brain" option of just making the mutex larger than
the size of the cache line and putting the relevant part in the middle...
struct PaddedMutex_s {
u64 pad1, pad2, pad3;
u64 real_part;
u64 pad4, pad5, pad6;
};
Then say (assuming a 32 byte cache line), no non-pad values can be in
the same cache line as real_part.
Little bigger for a 64 byte cache line, but same general idea.
On Tue, 30 Dec 2025 21:36:29 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
BGB <cr88192@gmail.com> posted:
Though, the main places where compressed instructions are likely to
bring meaningful benefit, is on small in-order machines.
Coincidentally; this is exactly where a fatter-ISA wins big::
compare::
LDD R7,[IP,R3<<3,DISP32]
1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against
AUPIC Rt,lo(DISP32)
SLL Ri,R3,#3
ADD Rt,Rt,hi(DISP32)
ADD Rt,Rt,Ri
LDD R7,0(Rt)
5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit
minimum.
Any OoO machine is also likely to have a lot of RAM and a decent
sized I$, so much of any benefit is likely to go away in this case.
And where do we have 95% of those small in-order machines? We have them
in flash-based micro-controllers, more often than not without I$, more
often than not running at higher clock than sustainable without wait
states by their program flash with 32-bit data bus. In other words, bottlenecked by instruction fetch before anything else, including
decode.
BGB trained his intuition on soft cores in FPGA, where trade offs are completely different. I am heavy users of soft cores too. But I realize
that 32-bit MCU cores outsell soft cores by more than order of
magnitude and quite likely by more than 2 orders of magnitude.
Michael S <already5chosen@yahoo.com> writes:
On Tue, 30 Dec 2025 11:13:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
Do you happen to have benchmarks that compare performance of Alpha
EV5 vs in-order Cortex-A ?
LaTeX benchmark results (lower is waster)
Alpha:
- 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5) 8.1
ARM A32/T32:
- Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
ARM A64:
- Rockpro64 (1416MHz Cortex A53) Debian 9 (Stretch) 3.24
- Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04
2.488
- Odroid C2 (1536MHz Cortex A53) Ubuntu 16.04 2.32
- Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended)
2.105
A problem with the LaTeX benchmark is that it performance is
significantly influenced by the LaTeX installation (newer versions
need more instructions, and having more packages needs more
instructions). But it's the only benchmark results I have.
- anton
Thank you.
Two 64-bit A53 results about the same as EV5 clock for clock and one
result is significantly better.
So, either wide in-order is indeed not bright idea or 21164 suffers
because of inferioriority of Alpha ISA relatively to ARM64.
Hard to tell from these results. In addition to the problems
mentioned above there are also differences in cache configuration to >consider. And the A55 does quite a bit better in IPC than the A53,
although it superficially has the same resources.
But, then again, weak model is cheaper to implement and generally
faster, although explicit synchronization is annoying and such a model
is incompatible with "lock free data structures" (which tend to
implicitly assume that memory accesses occur in the same order as
written and that any memory stores are immediately visible across threads).
On 12/30/2025 11:10 AM, BGB wrote:
[...]
But, then again, weak model is cheaper to implement and generally
faster, although explicit synchronization is annoying and such a model
is incompatible with "lock free data structures" (which tend to
implicitly assume that memory accesses occur in the same order as
written and that any memory stores are immediately visible across
threads).
Fwiw, a weak memory model is totally compatible with lock-free data structures. A weak model tends to have the necessary memory barriers to
make them work. Have you ever used a SPARC in RMO mode? Acquire membar
ala std::memory_order_acquire is basically a MEMBAR #LoadStore |
#LoadLoad. A release is MEMBAR #LoadStore | #StoreStore. Those can be
used for the implementation of a mutex. Notice how acquire and release
never need #StoreLoad ordering?
The point is that once we have this flexibility, a lock/wait free algo
can use the right membars for the job. Ideally, the weakest membars they
can use to ensure they are correct in their logic.
[...]
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
TSO is cycle-wasteful.
This has been experimentally verified for Apple Silicon:
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
"All swans are white" has been "experimentally verified" by finding one
white swan. The existence of black swans shows that such
"experimental verifications" are fallacies.
If you have any counter-examples, please feel free to cite them.
The fact that nobody had counterexamples for the theory "All swans are
white" for a long while did not make that theory true.
But I have actually seen black swans in Australia and elsewhere.
Here's a picture
<https://en.wikipedia.org/wiki/File:Black_Swan_in_Flight_Crop.jpg>
Actually the existence of a weak-memory-model mode on specific
hardware makes it very likely that TSO is slower than the weak model
on that hardware. If TSO was implemented to provide the same speed,
there would be no need for also providing a weaker memory ordering
mode on that hardware.
That makes little sense. TSO fulfulls all the requirements of the
ARM memory model, and adds some on top. It is possible to create
an ARM CPU which uses TSO, as you wrote below. If that had
the same performance as a CPU running on the pure ARM memory model,
there would be no reason to implement the ARM memory model at all.
Exactly. So you will only have a TSO mode and a weak mode, if, for
your particular implementation, the weak mode provides a performance advantage over the TSO mode. So it is unlikely that you ever see an implementation with such a mode bit where TSO mode has the same
performance, unless the mode bit is there only for backwards
compatibility and actually has no effect (what TRAPB turned into on
the 21264).
We have ARM implementations with TSO without mode bit. Do you really
need to add a mode bit to them that has no effect and therefore
produces 0% difference in performance to accept them as
counterexample?
So, two alternatives:
a) Apple engineers did not have the first clue what they were doing
b) Apple engineers knew what they were doing
My theory is that Apple engineers first implemented the weak model,
because that's what the specification says, and they were not tasked
to implement something better; there is enough propaganda for weak
memory models around that a hardware engineer might think that that's
the way to go. Later (for the M1) the Apple hardware designers were
asked to implement TSO, and they did not redo the whole memory model
from the ground up, but by doing relatively small changes. And of
course the result is that their TSO mode is slower than their weak
mode, just as -mieee-float on a 21164 is slower than code compiled
without that mode.
Given that Apple silicon seems to be competently done
The 21164 was the fastest CPU during some of its days. And yet the
21264 was faster without needing TRAPB.
I personally
think that option a) is the better one. You obviously prefer
option b).
An absurdity typical of you.
Note that "some Nvidia and Fujitsu [ARM architecture] implementations
run with TSO at all times" <https://lwn.net/Articles/970907/>. Given
that the Fujitsu implementations of the ARM architecture are used in
supercomputers, it is unlikely that their TSO implementation is
cycle-wasteful.
So, you're saying that Apple is clueless, and that Nvidia and Fujitsu
make that decision solely on the basis of speed?
I did not write anything about the clue of Apple. I don't know much
about the CPUs by Nvidia and Fujitsu. But if there was significant performance to be had by adding a weakly-ordered mode, wouldn't
especially Fujitsu with its supercomputer target have done it?
If hardware designers get tasked with implementing TSO (or sequential consistency) with modern transistor budgets, they hopefully come up
with different solutions to various problems than if they are tasked
to do a weak model with a slow TSO option.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
On Tue, 30 Dec 2025 11:13:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
Do you happen to have benchmarks that compare performance of
Alpha EV5 vs in-order Cortex-A ?
LaTeX benchmark results (lower is waster)
Alpha:
- 21164 600 MHz CPU, 2M L3-Cache, Redhat-Linux (a5)
8.1 ARM A32/T32:
- Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8
5.46 ARM A64:
- Rockpro64 (1416MHz Cortex A53) Debian 9 (Stretch)
3.24
- Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04
2.488
- Odroid C2 (1536MHz Cortex A53) Ubuntu 16.04
2.32
- Rock 5B (1805MHz A55) Debian 11 (texlive-latex-recommended)
2.105
A problem with the LaTeX benchmark is that it performance is
significantly influenced by the LaTeX installation (newer versions
need more instructions, and having more packages needs more
instructions). But it's the only benchmark results I have.
- anton
Thank you.
Two 64-bit A53 results about the same as EV5 clock for clock and one >>result is significantly better.
So, either wide in-order is indeed not bright idea or 21164 suffers >>because of inferioriority of Alpha ISA relatively to ARM64.
Hard to tell from these results. In addition to the problems
mentioned above there are also differences in cache configuration to >consider. And the A55 does quite a bit better in IPC than the A53, >although it superficially has the same resources.
The A5x cores are pretty old now. Do you have any results for
Neoverse-V2, at say 2Ghz?
EricP <ThatWouldBeTelling@thevillage.com> posted:
Michael S wrote:
On Tue, 30 Dec 2025 10:44:10 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Michael S wrote:
On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)What is this "avoidance of RFO"?
Thomas Koenig <tkoenig@netcologne.de> wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:WOW, they wrote article of 7 pages without even one time mentioning
My 66000 does not have a TSO memory system, but when one of these >>>>>>> things shows up, it goes sequential consistency, and when it isThis has been experimentally verified for Apple Silicon:
done it flips back to causal consistency.
TSO is cycle-wasteful.
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf >>>>>>
avoidance of RFO (read for ownership) which is an elephant in the
room of discussion of advantages of Arm MOM/MCM over TSO.
I can find no mention of it anywhere.
Imagine code that overwrites the whole cache line that core initially
did not own. Under TSO rules, like x86, the only [non heroic] ways to
overwrite the line without reading its previous content (which could
easily mean reading from DRAM) are
- aligned AVX512 store
- rep movs/rep stos
With weaker MOM the core has option of delayed merging of multiple
narrow stores. I think that even relatively old ARM cores, like Neoverse >>> N1, are able to do it.
I can imagine heroic microarchitecture that achieves the same effect
with TSO, but it seems that so far nobody did it.
I don't see how a ReadForOwnership message can be avoided as it
transfers two things: the ownership state, and the current line data.
InvalidateForOwnership {i.e., CI followed by Allocate Cache line and
start writing}
Even if the core knows the whole cache line is being overwritten and
doesn't need the line data, it still needs the Owned state transfer.
Which it can get by telling everyone else to loose that cache line.
There would still be a request message, say TakeOwner TKO which
has a smaller reply GiveOwner GVO message and just moves the state.
So the reply is a few less flits.
As I understand it...
Independent of the ReadForOwnership message, the ARM weak coherence model
should allow stores to other cache lines to proceed, whereas TSO would
require younger stores to (appear to) wait until the older store completes.
Which is why TSO is cycle wasteful.
The weak coherence model allows the cache to use hit-under-miss for
stores because it doesn't require the store order to different locations
be seen in program order. This allows it to overlap younger store cache
hits with the older ReadForOwnership message, not eliminate it.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
TSO is cycle-wasteful.
This has been experimentally verified for Apple Silicon:
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf
"All swans are white" has been "experimentally verified" by finding one >>>> white swan. The existence of black swans shows that such
"experimental verifications" are fallacies.
If you have any counter-examples, please feel free to cite them.
The fact that nobody had counterexamples for the theory "All swans are
white" for a long while did not make that theory true.
But I have actually seen black swans in Australia and elsewhere.
Here's a picture >><https://en.wikipedia.org/wiki/File:Black_Swan_in_Flight_Crop.jpg>
Your black swan analogy is a red herring. There is no theory of
evolution that says swans have to be white, and this is indeed a
very rare color for water birds.
On 12/30/2025 11:10 AM, BGB wrote:
[...]
But, then again, weak model is cheaper to implement and generally
faster, although explicit synchronization is annoying and such a model
is incompatible with "lock free data structures" (which tend to
implicitly assume that memory accesses occur in the same order as
written and that any memory stores are immediately visible across
threads).
Fwiw, a weak memory model is totally compatible with lock-free data structures. A weak model tends to have the necessary memory barriers to
make them work. Have you ever used a SPARC in RMO mode? Acquire membar
ala std::memory_order_acquire is basically a MEMBAR #LoadStore |
#LoadLoad. A release is MEMBAR #LoadStore | #StoreStore. Those can be
used for the implementation of a mutex. Notice how acquire and release
never need #StoreLoad ordering?
The point is that once we have this flexibility, a lock/wait free algo
can use the right membars for the job. Ideally, the weakest membars they
can use to ensure they are correct in their logic.
[...]
One would argue that maybe prefixes are themselves wonky, but otherwise
one needs:
Instructions that can directly encode the presence of large immediate values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one views
the instruction stream as linear, whereas suffixes add weirdness and are effectively retro-causal, and for any fetch to be safe at the end of a
cache line one would need to prove the non-existence of a suffix; so
better to not go there).
For the most part, superscalar works the same either way, with similar efficiency. There is a slight efficiency boost if it would be possible
to dynamically reshuffle ops during fetch. But, this is not currently a thing in my case.
This latter case would apply if, say, a MEM op is followed by non-
dependent ALU ops, which under current superscalar handling they will
not co-execute, but it could be possible in theory to swap the ops and
allow them to co-execute.
...
- anton
Do you have any results for
Neoverse-V2, at say 2Ghz?
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:The % associations you measured above might just be coincidence.
A bit of statistics on that.instruction complexity: if you support 16bit instructions, it means you >>>> support instructions which presumably don't do very much work becauseWell, Mitch claims average 35 bits per instructions, that means about >>>>>> 90% utilization of decoders, so not bad.His minimum instruction size is 32 bits, but I was going for 16 bits. >>>> BTW, my understanding of Mitch's design is that this is related to
it's hard to express a lot of "work to do" in 16bit.
Using a primitive Perl script to catch occurences, on a recent
My 66000 cmopiler, of the shape
[op] Ra,Ra,Rb
[op] Ra,Rb,Ra
[op] Ra,#n,Ra
[op] Ra,Ra,#n
[op] Ra,Rb
where |n| < 32, which could be a reasonable approximation of a
compressed instruction set, yields 14.9% (Perl), 16.6% (gnuplot)
and 23.9% (GSL) of such instructions. Potential space savings
would be a bit less than half that.
Better compression schemes are certainly possible, but I think the
disadvantages of having more complex encodings outweigh any
potential savings in instruction size.
I have assumed for a compiler to choose between two instruction formats,
a 2-register Rsd1 = Rsd1 OP Rs2, and a 3-register Rd1 = Rs2 OP Rs3,
that the register allocator would check if either operand was alive after
the OP, and if not then that source register can be reused as the dest.
For some ISA that may allow a shorter instruction format to be used.
Compilers will try to re-use registers as much as possible, in
other words, to avoid dead registers. If the compiler determines
that, for the pseudo registers V1, V2 and V3,
V1 = V2 - V3;
V2 is no longer live after that statement, it will assign
the same hard register to V1 and V2 (unless there are other
considerations such as function return values) which will then
either be translated into
add r1,r1,-r2
for a three-register instruction, or, for example, into
subq %rsi, %rax
Hmm... thinking of the statistics above, maybe I should have
included the minus signs.
Your stats above assume the compiler is performing this optimization
but since My 66000 does not have short format instructions the compiler
would have no reason to do so. Or the compiler might be doing this
optimization anyways for other ISA such as x86/x64 which do have
shorter formats.
So the % numbers you measured might just be coincidence and could be low.
An ISA with both short 2- and long 3- register formats like RV where there >> is an incentive to do this optimization might provide stats confirmation.
RISC-V compressed mode also uses three-bit register numbers for
popular registers, all of which complicates decoding and causes
other problems which Mitch has explained previously.
So yes, a My 66000-like instruction set with compression might be
possible, but would almost certainly not be realized.
On Tue, 30 Dec 2025 17:27:22 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I did not write anything about the clue of Apple. I don't know much
about the CPUs by Nvidia and Fujitsu. But if there was significant
performance to be had by adding a weakly-ordered mode, wouldn't
especially Fujitsu with its supercomputer target have done it?
Fujitsu had very strong reason to implement TSO on A64FX - source-level >compatibility with SPARC64 VIIIfx and XIfx.
I wouldn't be surprised if apart from that they have SPARC->ARM Rosetta
for some customers, but that's relatively minor factor. Supercomputer
users are considered willing to recompile their code. But much less
willing to re-write it.
Besides, as I mentioned in my other post, A64fx memory subsystem is slow >(latency-wise, throughput wise it is very good).
I don't know what
influence that fact has, but I can hand-wave that it shifts the balance
of cost toward TSO.
Also, cache lines are unusually wide (256B), so it
is possible that RFO shortcuts allowed by weaker MOM are less feasible.
Anton Ertl [2025-12-30 17:15:59] wrote:
Any sequence of stores without intervening loads can be turned into
one store under sequential consistency, and therefore also under the
weaker TSO. Doing that for a sequence that stores into one cache line
does not appear particularly heroic to me. The question is how much
benefit one gets from this optimization.
But the stores may be interleaved with loads from other locations!
It's quite common to have a situation where a sequence of stores
initializes a new object and thus overwrites a complete cache line, but
that initialization sequence needs to read from memory (e.g. from the
stack).
Maybe compilers can be taught to group such writes to try and avoid
the problem?
a = y->a;...
b = y->b;
a = a;...
b = b;
scott@slp53.sl.home (Scott Lurndal) writes:
Do you have any results for
Neoverse-V2, at say 2Ghz?
No. You can find other results at
https://www.complang.tuwien.ac.at/franz/latex-bench
- anton
EricP <ThatWouldBeTelling@thevillage.com> posted:
John Savard wrote:
On Sun, 21 Dec 2025 20:32:44 +0000, MitchAlsup wrote:There are two kinds of stalls:
I agree - and strongly, too - that the decoder ought not to be the part >>> that limits performance.On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:Not a typo--the part of the pipeline which is <dynamically> narrowest is >>>> the part that limits performance. I suggest strongly that you should not >>>> make/allow the decoder to play that part.
Or in other words, if you can decode K-instructions per cycle, you'd >>>>>> better be able to execute K-instructions per cycle--or you have a
serious blockage in your pipeline.
But what I quoted says that the execution unit ought not to be the part >>> that limits performance, with the implication that it's OK if the decoder >>> does instead. That's why I said it must be a typo.
So I think you need to look a second time at what you wrote; it's natural >>> for people to see what they expect to see, and so I think you looked at >>> it, and didn't see the typo that was there.
John Savard
stalls in the serial front end I-cache, Fetch or Decode stages because
of *too little work* (starvation due to input latency),
and stalls in the back end Execute or Writeback stages because
of *too much work* (resource exhaustion).
DECODE latency increases when:
a) there is no instruction(s) to decode
b) there is no address from which to fetch
c) when there is no translation of the fetch address
a) is a cache miss
b) is an indirect control transfer
c) is a TLB miss
And there may be additional cases of instruction buffer hiccups.
The front end stalls inject bubbles into the pipeline,
whereas back end stalls can allow younger bubbles to be compressed out.
How In-Order your thinking is. GBOoO machine do not inject bubbles.
If I have to stall, I want it in the back end.
If I have to stall I want it based on "realized" latency.
It has to do with catching up after a stall.
Which is why you do not inject bubbles...
If a core stalls for 3 clocks, then in order to average 1 IPC
it must retire 2 instructions per clock for the next 3 clocks.
And it can only do that if it has a backlog of work ready to execute.
The packing algorithm for RV or similar is more complicated because
it uses different register set sizes, RV is 31 or 7 with a zero reg.
Looking at the "18 RISC-V Compressed ISA V1.9" specification
BGB <cr88192@gmail.com> posted:
On 12/30/2025 1:36 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller
register space, one needs to reuse registers more frequently, which then >> reduces ILP due to register conflicts. So, smaller code at the expense >> of worse performance.
For designs like RISC-V C and Thumb2, there is always the option to
use the uncompressed instruction. So you may tune your RISC-V
compiler to prefer registers r8-r15 for those pseudo-registers that
occur in instructions where such a register allocation may lead to a compressed encoding.
Write-after-read and write-after-write does not reduce the IPC of OoO implementations. On the contrary, write-after-read may be beneficial
by releasing the old physical register for the register name. And designing a compressed CPU instruction set for in-order processing is
not a good idea for general-purpose computing.
Though, the main places where compressed instructions are likely to
bring meaningful benefit, is on small in-order machines.
Coincidentally; this is exactly where a fatter-ISA wins big::
compare::
LDD R7,[IP,R3<<3,DISP32]
1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against
AUPIC Rt,lo(DISP32)
SLL Ri,R3,#3
ADD Rt,Rt,hi(DISP32)
ADD Rt,Rt,Ri
LDD R7,0(Rt)
5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum.
Any OoO machine is also likely to have a lot of RAM and a decent sized
I$, so much of any benefit is likely to go away in this case.
s/go away/greatly ameliorated/
------------------------
ILP is a property of a program. I assume that what you mean is that
the IPC benefits of more width have quickly diminishing returns on in-order machines.
--- Synchronet 3.21a-Linux NewsLink 1.2The ILP is a property of the code, yes, but how much exists, and how
much of it is actually usable, is effected by the processor implementation.
I agree that ILP is more aligned with code than with program.
{see above example where 1 instruction does the work of 5}
Michael S <already5chosen@yahoo.com> writes:
On Tue, 30 Dec 2025 17:27:22 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I did not write anything about the clue of Apple. I don't know
much about the CPUs by Nvidia and Fujitsu. But if there was
significant performance to be had by adding a weakly-ordered mode,
wouldn't especially Fujitsu with its supercomputer target have
done it?
Fujitsu had very strong reason to implement TSO on A64FX -
source-level compatibility with SPARC64 VIIIfx and XIfx.
Apple also had good reason to implement TSO on M1: AMD64->ARM A64
binary translation (Rosetta). They chose to add a slower TSO mode to
their weak memory system, which is not surprising given that they had
a working weak memory system, and it is relatively easy to implement
TSO on that (with a performance penalty).
I wouldn't be surprised if apart from that they have SPARC->ARM
Rosetta for some customers, but that's relatively minor factor. >Supercomputer users are considered willing to recompile their code.
But much less willing to re-write it.
While supercomputer users may not be particularly willing to rewrite
their code, they are much more willing than anyone else, because in supercomputing, hardware cost still is higher than software cost.
If there was an easy way to offer "5-10% more performance" to those
users willing to write or use software written for weak memory models
by adding a weak memory mode to A64FX, I would be very surprised if
they would have passed. So I conclude that it's not easy to turn
their memory model into a weak one and gain performance.
Concerning their SPARC implementations: The SPARC architecture
specifies both TSO and a weak memory model. Does your comment about
SPARC64 VIIIfx and XIfx mean that Fujitsu only implemented TSO on
those CPUs and that when you asked for the weak mode on those CPUs,
you still got TSO? That would be the counterexample that Thomas
Koenig asked for.
Besides, as I mentioned in my other post, A64fx memory subsystem is
slow (latency-wise, throughput wise it is very good).
Sounds to me like it is designed for a supercomputer.
I don't know what
influence that fact has, but I can hand-wave that it shifts the
balance of cost toward TSO.
Can you elaborate on that?
Also, cache lines are unusually wide (256B), so it
is possible that RFO shortcuts allowed by weaker MOM are less
feasible.
Why should that be?
A particular aspect here is that RFO is rare in applications with good temporal locality. Supercomputer applications tend to have relatively
bad temporal locality and will see RFO more often.
- anton
Stefan Monnier <monnier@iro.umontreal.ca> writes:
Anton Ertl [2025-12-30 17:15:59] wrote:
Any sequence of stores without intervening loads can be turned into
one store under sequential consistency, and therefore also under the
weaker TSO. Doing that for a sequence that stores into one cache line
does not appear particularly heroic to me. The question is how much
benefit one gets from this optimization.
But the stores may be interleaved with loads from other locations!
It's quite common to have a situation where a sequence of stores >initializes a new object and thus overwrites a complete cache line, but >that initialization sequence needs to read from memory (e.g. from the >stack).
Maybe compilers can be taught to group such writes to try and avoid
the problem?
What compilers can do depends on the programming language. But this
"read from stack" idea is curious. We have had good register
allocators for several decades, so local variables tend to reside in registers, not on some stack. Parameters also tend to reside in
registers. So if you have a C initializing function
void init_foo(foo_t *foo, long a, long b, /* ... */)
{
size_t i;
foo->a = a;
foo->b = b;
for (i=0; i<FOO_C_ELEMS; i++)
foo->c[i] = 0;
}
it is unlikely that there will be loads between the stores.
In other cases you can reorder the loads and stores by hand. Instead of
a = y->a;..
b = y->b;
you can do
long a = y->a;
long b = y->b;
..
a = a;..
b = b;
This kind of reordering only needs to be performed where it eliminates--- Synchronet 3.21a-Linux NewsLink 1.2
many RFOs, and is much easier to get correct than not-too-slow code
for weak memory models that has to be done everywhere where shared
written-to memory is accessed and has to be correct everywhere (and not-too-slow everywhere that is executed frequently).
- anton
MitchAlsup wrote:
EricP <ThatWouldBeTelling@thevillage.com> posted:
John Savard wrote:
On Sun, 21 Dec 2025 20:32:44 +0000, MitchAlsup wrote:There are two kinds of stalls:
I agree - and strongly, too - that the decoder ought not to be the part >>> that limits performance.On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:Not a typo--the part of the pipeline which is <dynamically> narrowest is >>>> the part that limits performance. I suggest strongly that you should not >>>> make/allow the decoder to play that part.
Or in other words, if you can decode K-instructions per cycle, you'd >>>>>> better be able to execute K-instructions per cycle--or you have a >>>>>> serious blockage in your pipeline.
But what I quoted says that the execution unit ought not to be the part >>> that limits performance, with the implication that it's OK if the decoder
does instead. That's why I said it must be a typo.
So I think you need to look a second time at what you wrote; it's natural
for people to see what they expect to see, and so I think you looked at >>> it, and didn't see the typo that was there.
John Savard
stalls in the serial front end I-cache, Fetch or Decode stages because
of *too little work* (starvation due to input latency),
and stalls in the back end Execute or Writeback stages because
of *too much work* (resource exhaustion).
DECODE latency increases when:
a) there is no instruction(s) to decode
b) there is no address from which to fetch
c) when there is no translation of the fetch address
a) is a cache miss
b) is an indirect control transfer
c) is a TLB miss
And there may be additional cases of instruction buffer hiccups.
Yes. Also Decode generated stalls - pipeline drain.
Rename stall for new dest register pool exhaustion.
The front end stalls inject bubbles into the pipeline,
whereas back end stalls can allow younger bubbles to be compressed out.
How In-Order your thinking is. GBOoO machine do not inject bubbles.
You get bubbles if you overload their resources no matter how GB it is.
For example, if all the reservation stations for a FU are in use then Dispatch has to stall, which stalls the whole front end.
A compacting pipeline in the front end can compress out those bubbles
but it eventually stalls too.
Dependency stalls - all the uOps in reservation stations are waiting
on other results. Serialization stalls.
If a design is doing dynamic register file read port assignment and
runs out of read ports. Resource exhaustion stalls.
Multiple uOps are ready but only one can launch. Scheduling stalls.
If I have to stall, I want it in the back end.
If I have to stall I want it based on "realized" latency.
It has to do with catching up after a stall.
Which is why you do not inject bubbles...
It's not me doing it. I blame the speed of light.
If a core stalls for 3 clocks, then in order to average 1 IPC
it must retire 2 instructions per clock for the next 3 clocks.
And it can only do that if it has a backlog of work ready to execute.
Michael S <already5chosen@yahoo.com> writes:
On Tue, 30 Dec 2025 17:27:22 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I did not write anything about the clue of Apple. I don't know
much about the CPUs by Nvidia and Fujitsu. But if there was
significant performance to be had by adding a weakly-ordered mode,
wouldn't especially Fujitsu with its supercomputer target have
done it?
Fujitsu had very strong reason to implement TSO on A64FX -
source-level compatibility with SPARC64 VIIIfx and XIfx.
BTW, can you find a proof link for A64FX being TSO.
My understanding is that you learned it from Jonathan Corbet who in turn >learned it from Hector Martin.
But what is the source of Hector Martin?
I certainly don't see it in Fujitsu's "A64FX Microarchitecture Manual"
or in the Datasheet.
Michael S <already5chosen@yahoo.com> writes:
BTW, can you find a proof link for A64FX being TSO.
My understanding is that you learned it from Jonathan Corbet who in
turn learned it from Hector Martin.
Correct.
But what is the source of Hector Martin?
I certainly don't see it in Fujitsu's "A64FX Microarchitecture
Manual" or in the Datasheet.
In that case, I would ask Hector Martin, if I wanted proof.
- anton
I don't know Hector Martin. Is he on Usenet?
On 12/30/2025 4:58 PM, Chris M. Thomasson wrote:
On 12/30/2025 11:10 AM, BGB wrote:
[...]
But, then again, weak model is cheaper to implement and generally
faster, although explicit synchronization is annoying and such a
model is incompatible with "lock free data structures" (which tend to
implicitly assume that memory accesses occur in the same order as
written and that any memory stores are immediately visible across
threads).
Fwiw, a weak memory model is totally compatible with lock-free data
structures. A weak model tends to have the necessary memory barriers
to make them work. Have you ever used a SPARC in RMO mode? Acquire
membar ala std::memory_order_acquire is basically a MEMBAR #LoadStore
| #LoadLoad. A release is MEMBAR #LoadStore | #StoreStore. Those can
be used for the implementation of a mutex. Notice how acquire and
release never need #StoreLoad ordering?
The point is that once we have this flexibility, a lock/wait free algo
can use the right membars for the job. Ideally, the weakest membars
they can use to ensure they are correct in their logic.
Usually IME the people writing lock-free code don't use memory barriers
or similar though. A lot of times IME, it is people just using volatile
or similar and trying to write things in a way that it (hopefully) wont
go terribly wrong if two threads hit the same data at the same time.
Like, the sort of code that works on a PC running Windows or similar,
but try to port it to Linux on an ARM machine, and it explodes.
Where, say, using volatile isn't sufficient for multiple cores with a
weak model. One would need either to use barriers (though, in my case, barriers will also be slow), non-cached memory accesses, or explicit cache-line flushing.
In this case, this leaves it often preferable to use bulk mostly read-
only data sharing. Or, passing along data via buffers or messages (with
some level of basic flow control).
So, not so much "lets have two threads share a doubly-linked list and
hope it doesn't all turn into a train wreck", and more "will copy
messages onto the end of a circular buffer and advance the roving
pointers; manually flushing the lines corresponding to the parts of the buffer than have been updated in the process".
Say, for example:
-a void _flushbuffer(void *data, size_t sz)
-a {
-a-a-a char *ct, *cte;
-a-a-a ct=data; cte=ct+sz;
-a-a-a while(ct<cte)
-a-a-a-a-a { __mem_flushaddr(ct); ct+=LINESIZE; }
-a }
-a void _memcpyout_flush(void *dst, void *src, size_t sz)
-a {
-a-a-a memcpy(dst, src, sz);
-a-a-a _flushbuffer(dst, sz);
-a }
-a void _memcpyin_flush(void *dst, void *src, size_t sz)
-a {
-a-a-a _flushbuffer(src, sz);
-a-a-a memcpy(dst, src, sz);
-a }
-a void _memcpy_flush(void *dst, void *src, size_t sz)
-a {
-a-a-a _flushbuffer(src, sz);
-a-a-a memcpy(dst, src, sz);
-a-a-a _flushbuffer(dst, sz);
-a }
Where, in this case, normal memcpy + flushing is likely to be faster in
many cases than using non-cached memory.
On 12/30/2025 12:08 PM, MitchAlsup wrote:
EricP <ThatWouldBeTelling@thevillage.com> posted:
Michael S wrote:
On Tue, 30 Dec 2025 10:44:10 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Michael S wrote:
On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)What is this "avoidance of RFO"?
Thomas Koenig <tkoenig@netcologne.de> wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:room of discussion of advantages of Arm MOM/MCM over TSO.
My 66000 does not have a TSO memory system, but when one of these >>>>>>>> things shows up, it goes sequential consistency, and when it is >>>>>>>> done it flips back to causal consistency.This has been experimentally verified for Apple Silicon:
TSO is cycle-wasteful.
https://www.sra.uni-hannover.de/Publications/2024/wrenger_24_jsa.pdf >>>>>> WOW, they wrote article of 7 pages without even one time mentioning >>>>>> avoidance of RFO (read for ownership) which is an elephant in the
I can find no mention of it anywhere.
Imagine code that overwrites the whole cache line that core initially
did not own. Under TSO rules, like x86, the only [non heroic] ways to
overwrite the line without reading its previous content (which could
easily mean reading from DRAM) are
- aligned AVX512 store
- rep movs/rep stos
With weaker MOM the core has option of delayed merging of multiple
narrow stores. I think that even relatively old ARM cores, like
Neoverse
N1, are able to do it.
I can imagine heroic microarchitecture that achieves the same effect
with TSO, but it seems that so far nobody did it.
I don't see how a ReadForOwnership message can be avoided as it
transfers two things: the ownership state, and the current line data.
InvalidateForOwnership {i.e., CI followed by Allocate Cache line and
start writing}
Even if the core knows the whole cache line is being overwritten and
doesn't need the line data, it still needs the Owned state transfer.
Which it can get by telling everyone else to loose that cache line.
There would still be a request message, say TakeOwner TKO which
has a smaller reply GiveOwner GVO message and just moves the state.
So the reply is a few less flits.
As I understand it...
Independent of the ReadForOwnership message, the ARM weak coherence
model
should allow stores to other cache lines to proceed, whereas TSO would
require younger stores to (appear to) wait until the older store
completes.
Which is why TSO is cycle wasteful.
Wrt to TSO aka x86/x64..., even that is NOT strong enough to get
#StoreLoad, aka ordering a store followed by a load to another location.
You need a LOCK'ed RMW or the MFENCE instruction.
The weak coherence model allows the cache to use hit-under-miss for
stores because it doesn't require the store order to different locations >>> be seen in program order. This allows it to overlap younger store cache
hits with the older ReadForOwnership message, not eliminate it.
On 12/30/2025 3:51 PM, Chris M. Thomasson wrote:
On 12/30/2025 12:08 PM, MitchAlsup wrote:
EricP <ThatWouldBeTelling@thevillage.com> posted:
Michael S wrote:
On Tue, 30 Dec 2025 10:44:10 -0500
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Michael S wrote:
On Tue, 30 Dec 2025 08:30:08 -0000 (UTC)What is this "avoidance of RFO"?
Thomas Koenig <tkoenig@netcologne.de> wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:WOW, they wrote article of 7 pages without even one time mentioning >>>>>>> avoidance of RFO (read for ownership) which is an elephant in the >>>>>>> room of discussion of advantages of Arm MOM/MCM over TSO.
My 66000 does not have a TSO memory system, but when one of these >>>>>>>>> things shows up, it goes sequential consistency, and when it is >>>>>>>>> done it flips back to causal consistency.This has been experimentally verified for Apple Silicon:
TSO is cycle-wasteful.
https://www.sra.uni-hannover.de/Publications/2024/
wrenger_24_jsa.pdf
I can find no mention of it anywhere.
Imagine code that overwrites the whole cache line that core initially >>>>> did not own. Under TSO rules, like x86, the only [non heroic] ways to >>>>> overwrite the line without reading its previous content (which could >>>>> easily mean reading from DRAM) are
- aligned AVX512 store
- rep movs/rep stos
With weaker MOM the core has option of delayed merging of multiple
narrow stores. I think that even relatively old ARM cores, like
Neoverse
N1, are able to do it.
I can imagine heroic microarchitecture that achieves the same effect >>>>> with TSO, but it seems that so far nobody did it.
I don't see how a ReadForOwnership message can be avoided as it
transfers two things: the ownership state, and the current line data.
InvalidateForOwnership {i.e., CI followed by Allocate Cache line and
start writing}
Even if the core knows the whole cache line is being overwritten and
doesn't need the line data, it still needs the Owned state transfer.
Which it can get by telling everyone else to loose that cache line.
There would still be a request message, say TakeOwner TKO which
has a smaller reply GiveOwner GVO message and just moves the state.
So the reply is a few less flits.
As I understand it...
Independent of the ReadForOwnership message, the ARM weak coherence
model
should allow stores to other cache lines to proceed, whereas TSO would >>>> require younger stores to (appear to) wait until the older store
completes.
Which is why TSO is cycle wasteful.
Wrt to TSO aka x86/x64..., even that is NOT strong enough to get
#StoreLoad, aka ordering a store followed by a load to another
location. You need a LOCK'ed RMW or the MFENCE instruction.
The weak coherence model allows the cache to use hit-under-miss for
stores because it doesn't require the store order to different
locations
be seen in program order. This allows it to overlap younger store cache >>>> hits with the older ReadForOwnership message, not eliminate it.
Think of something simple. You want to publish a pointer to other threads.
foo* global = nullptr;
producer:
-a-a foo* p = create();
-a-a membar_release();
-a-a atomic_store(&global, p);
consumers:
-a-a foo* p = atomic_load(&global);
-a-a if (p)
-a-a {
-a-a-a-a-a-a membar_acquire();
-a-a-a-a-a-a p->bar();
-a-a }
A simple pattern...
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
BGB <cr88192@gmail.com> posted:
On 12/30/2025 1:36 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller
register space, one needs to reuse registers more frequently, which then >>>>> reduces ILP due to register conflicts. So, smaller code at the expense >>>>> of worse performance.
For designs like RISC-V C and Thumb2, there is always the option to
use the uncompressed instruction. So you may tune your RISC-V
compiler to prefer registers r8-r15 for those pseudo-registers that
occur in instructions where such a register allocation may lead to a
compressed encoding.
Write-after-read and write-after-write does not reduce the IPC of OoO
implementations. On the contrary, write-after-read may be beneficial
by releasing the old physical register for the register name. And
designing a compressed CPU instruction set for in-order processing is
not a good idea for general-purpose computing.
Though, the main places where compressed instructions are likely to
bring meaningful benefit, is on small in-order machines.
Coincidentally; this is exactly where a fatter-ISA wins big::
compare::
LDD R7,[IP,R3<<3,DISP32]
1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against
It is only 2 words
AUPIC Rt,lo(DISP32)
SLL Ri,R3,#3
ADD Rt,Rt,hi(DISP32)
ADD Rt,Rt,Ri
LDD R7,0(Rt)
5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum.
This should be::
AUPIC Rt,hi(DISP32)
SLL Ri,R3,#3
ADD Rt,Rt,Ri
LDD R7,lo(DISP32)(Rt)
4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum
Any OoO machine is also likely to have a lot of RAM and a decent sized
I$, so much of any benefit is likely to go away in this case.
s/go away/greatly ameliorated/
------------------------
I agree that ILP is more aligned with code than with program.ILP is a property of a program. I assume that what you mean is that
the IPC benefits of more width have quickly diminishing returns on
in-order machines.
The ILP is a property of the code, yes, but how much exists, and how
much of it is actually usable, is effected by the processor implementation. >>
{see above example where 1 instruction does the work of 5}
On 12/30/2025 9:21 PM, BGB wrote:
On 12/30/2025 4:58 PM, Chris M. Thomasson wrote:
On 12/30/2025 11:10 AM, BGB wrote:
[...]
But, then again, weak model is cheaper to implement and generally
faster, although explicit synchronization is annoying and such a
model is incompatible with "lock free data structures" (which tend
to implicitly assume that memory accesses occur in the same order as
written and that any memory stores are immediately visible across
threads).
Fwiw, a weak memory model is totally compatible with lock-free data
structures. A weak model tends to have the necessary memory barriers
to make them work. Have you ever used a SPARC in RMO mode? Acquire
membar ala std::memory_order_acquire is basically a MEMBAR #LoadStore
| #LoadLoad. A release is MEMBAR #LoadStore | #StoreStore. Those can
be used for the implementation of a mutex. Notice how acquire and
release never need #StoreLoad ordering?
The point is that once we have this flexibility, a lock/wait free
algo can use the right membars for the job. Ideally, the weakest
membars they can use to ensure they are correct in their logic.
Usually IME the people writing lock-free code don't use memory
barriers or similar though. A lot of times IME, it is people just
using volatile or similar and trying to write things in a way that it
(hopefully) wont go terribly wrong if two threads hit the same data at
the same time.
Like, the sort of code that works on a PC running Windows or similar,
but try to port it to Linux on an ARM machine, and it explodes.
Where, say, using volatile isn't sufficient for multiple cores with a
weak model. One would need either to use barriers (though, in my case,
barriers will also be slow), non-cached memory accesses, or explicit
cache-line flushing.
In this case, this leaves it often preferable to use bulk mostly read-
only data sharing. Or, passing along data via buffers or messages
(with some level of basic flow control).
So, not so much "lets have two threads share a doubly-linked list and
hope it doesn't all turn into a train wreck", and more "will copy
messages onto the end of a circular buffer and advance the roving
pointers; manually flushing the lines corresponding to the parts of
the buffer than have been updated in the process".
Say, for example:
-a-a void _flushbuffer(void *data, size_t sz)
-a-a {
-a-a-a-a char *ct, *cte;
-a-a-a-a ct=data; cte=ct+sz;
-a-a-a-a while(ct<cte)
-a-a-a-a-a-a { __mem_flushaddr(ct); ct+=LINESIZE; }
-a-a }
-a-a void _memcpyout_flush(void *dst, void *src, size_t sz)
-a-a {
-a-a-a-a memcpy(dst, src, sz);
-a-a-a-a _flushbuffer(dst, sz);
-a-a }
-a-a void _memcpyin_flush(void *dst, void *src, size_t sz)
-a-a {
-a-a-a-a _flushbuffer(src, sz);
-a-a-a-a memcpy(dst, src, sz);
-a-a }
-a-a void _memcpy_flush(void *dst, void *src, size_t sz)
-a-a {
-a-a-a-a _flushbuffer(src, sz);
-a-a-a-a memcpy(dst, src, sz);
-a-a-a-a _flushbuffer(dst, sz);
-a-a }
Where, in this case, normal memcpy + flushing is likely to be faster
in many cases than using non-cached memory.
Huh? Humm... This is way out of bounds. Yikes. I am talking about
knowing when to use the right membars in the right places. Are you at
least familiar with std::memory_order_* in C++?
On 2025-12-31 12:12 p.m., MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
BGB <cr88192@gmail.com> posted:
On 12/30/2025 1:36 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller
register space, one needs to reuse registers more frequently, which then
reduces ILP due to register conflicts. So, smaller code at the expense >>>>> of worse performance.
For designs like RISC-V C and Thumb2, there is always the option to
use the uncompressed instruction. So you may tune your RISC-V
compiler to prefer registers r8-r15 for those pseudo-registers that
occur in instructions where such a register allocation may lead to a >>>> compressed encoding.
Write-after-read and write-after-write does not reduce the IPC of OoO >>>> implementations. On the contrary, write-after-read may be beneficial >>>> by releasing the old physical register for the register name. And
designing a compressed CPU instruction set for in-order processing is >>>> not a good idea for general-purpose computing.
Though, the main places where compressed instructions are likely to
bring meaningful benefit, is on small in-order machines.
Coincidentally; this is exactly where a fatter-ISA wins big::
compare::
LDD R7,[IP,R3<<3,DISP32]
1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against
It is only 2 words
AUPIC Rt,lo(DISP32)
SLL Ri,R3,#3
ADD Rt,Rt,hi(DISP32)
ADD Rt,Rt,Ri
LDD R7,0(Rt)
5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum.
This should be::
AUPIC Rt,hi(DISP32)
SLL Ri,R3,#3
ADD Rt,Rt,Ri
LDD R7,lo(DISP32)(Rt)
4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum
An even fatter ISA (Qupls4) in theory:
LOAD r7, disp56(ip+r3*8)
1 instruction + 1 postfix = 2 words (96 bits) 1 cycle + cache hit minimum
The ISA is becoming a bit more stable now; the latest change was for constant postfix instructions. Qupls used to have a somewhat convoluted means of addressing constants on the cache-line. Now itrCOs just
postfixes. The constant routing information is in the postfix now which
uses four bits. Two to select a register override, two to select
constant quadrant. So, postfixes extend constants in the instruction (or previous postfix) by 36 bits.
Qupls can do
ADD r7, r8, $64_bit_constant
Using only two words (96 bits) and just a single cycle.
I prefer to use multiply rCy*rCO rather than shift in scaled indexed addressing as a couple of CPUs had multiply by five and ten in addition
to 1,2,4,8. What if one wants to scale by 3?
It is also possible to encode 128-bit constants, but the current implementation does not support them.
Managed to get to some early synthesis trials and found the instruction dispatch to be on the critical timing path. I am a bit stumped as to how
to improve it as it is very simple already. It just copies from one set
of pipeline registers to another headed towards the reservation
stations. Tools report timing good to 37 MHz, I was shooting for at
least 40.
Found a couple of spots where the code was simple but too slow. One in dynamic register selection. The code was packing the register selections
to a minimum. But that was way too many logic levels.
It is quite an art to get something working in minimum clock cycles and
fast clock frequency.
Any OoO machine is also likely to have a lot of RAM and a decent sized >>> I$, so much of any benefit is likely to go away in this case.
s/go away/greatly ameliorated/
------------------------
ILP is a property of a program. I assume that what you mean is that >>>> the IPC benefits of more width have quickly diminishing returns on
in-order machines.
The ILP is a property of the code, yes, but how much exists, and how
much of it is actually usable, is effected by the processor implementation.
I agree that ILP is more aligned with code than with program.
{see above example where 1 instruction does the work of 5}
Robert Finch <robfi680@gmail.com> posted:
On 2025-12-31 12:12 p.m., MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
BGB <cr88192@gmail.com> posted:
On 12/30/2025 1:36 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller >>>>>>> register space, one needs to reuse registers more frequently, which then
reduces ILP due to register conflicts. So, smaller code at the expense >>>>>>> of worse performance.
For designs like RISC-V C and Thumb2, there is always the option to >>>>>> use the uncompressed instruction. So you may tune your RISC-V
compiler to prefer registers r8-r15 for those pseudo-registers that >>>>>> occur in instructions where such a register allocation may lead to a >>>>>> compressed encoding.
Write-after-read and write-after-write does not reduce the IPC of OoO >>>>>> implementations. On the contrary, write-after-read may be beneficial >>>>>> by releasing the old physical register for the register name. And >>>>>> designing a compressed CPU instruction set for in-order processing is >>>>>> not a good idea for general-purpose computing.
Though, the main places where compressed instructions are likely to
bring meaningful benefit, is on small in-order machines.
Coincidentally; this is exactly where a fatter-ISA wins big::
compare::
LDD R7,[IP,R3<<3,DISP32]
1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against
It is only 2 words
AUPIC Rt,lo(DISP32)This should be::
SLL Ri,R3,#3
ADD Rt,Rt,hi(DISP32)
ADD Rt,Rt,Ri
LDD R7,0(Rt)
5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum. >>>
AUPIC Rt,hi(DISP32)
SLL Ri,R3,#3
ADD Rt,Rt,Ri
LDD R7,lo(DISP32)(Rt)
4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum
An even fatter ISA (Qupls4) in theory:
LOAD r7, disp56(ip+r3*8)
I could have shown the DISP64 version--3-words
1 instruction + 1 postfix = 2 words (96 bits) 1 cycle + cache hit minimum
The ISA is becoming a bit more stable now; the latest change was for
constant postfix instructions. Qupls used to have a somewhat convoluted
means of addressing constants on the cache-line. Now itrCOs just
postfixes. The constant routing information is in the postfix now which
uses four bits. Two to select a register override, two to select
constant quadrant. So, postfixes extend constants in the instruction (or
previous postfix) by 36 bits.
Qupls can do
ADD r7, r8, $64_bit_constant
Using only two words (96 bits) and just a single cycle.
So can My 66000, but everyone and his brother thinks 96-bits is 3 words.
I prefer to use multiply rCy*rCO rather than shift in scaled indexed
addressing as a couple of CPUs had multiply by five and ten in addition
to 1,2,4,8. What if one wants to scale by 3?
If you have the bits, why not.
It is also possible to encode 128-bit constants, but the current
implementation does not support them.
Managed to get to some early synthesis trials and found the instruction
dispatch to be on the critical timing path. I am a bit stumped as to how
to improve it as it is very simple already. It just copies from one set
of pipeline registers to another headed towards the reservation
stations. Tools report timing good to 37 MHz, I was shooting for at
least 40.
Found a couple of spots where the code was simple but too slow. One in
dynamic register selection. The code was packing the register selections
to a minimum. But that was way too many logic levels.
Those are some of the driving inputs to "An architecture is as much about what gets left out as what gets put in."
It is quite an art to get something working in minimum clock cycles and
fast clock frequency.
Any OoO machine is also likely to have a lot of RAM and a decent sized >>>>> I$, so much of any benefit is likely to go away in this case.
s/go away/greatly ameliorated/
------------------------
ILP is a property of a program. I assume that what you mean is that >>>>>> the IPC benefits of more width have quickly diminishing returns on >>>>>> in-order machines.
The ILP is a property of the code, yes, but how much exists, and how >>>>> much of it is actually usable, is effected by the processor implementation.
I agree that ILP is more aligned with code than with program.
{see above example where 1 instruction does the work of 5}
On 1/1/2026 12:13 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-12-31 12:12 p.m., MitchAlsup wrote:
An even fatter ISA (Qupls4) in theory:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
It is only 2 words
BGB <cr88192@gmail.com> posted:
On 12/30/2025 1:36 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller >>>>>>>> register space, one needs to reuse registers more frequently, >>>>>>>> which then
reduces ILP due to register conflicts. So, smaller code at the >>>>>>>> expense
of worse performance.
For designs like RISC-V C and Thumb2, there is always the option to >>>>>>> use the uncompressed instruction.-a So you may tune your RISC-V
compiler to prefer registers r8-r15 for those pseudo-registers that >>>>>>> occur in instructions where such a register allocation may lead to a >>>>>>> compressed encoding.
Write-after-read and write-after-write does not reduce the IPC of >>>>>>> OoO
implementations.-a On the contrary, write-after-read may be
beneficial
by releasing the old physical register for the register name.-a And >>>>>>> designing a compressed CPU instruction set for in-order
processing is
not a good idea for general-purpose computing.
Though, the main places where compressed instructions are likely to >>>>>> bring meaningful benefit, is on small in-order machines.
Coincidentally; this is exactly where a fatter-ISA wins big::
compare::
-a-a-a-a-a-a LDD-a-a-a R7,[IP,R3<<3,DISP32]
1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against >>>>
-a-a-a-a-a-a AUPIC-a Rt,lo(DISP32)
-a-a-a-a-a-a SLL-a-a-a Ri,R3,#3
-a-a-a-a-a-a ADD-a-a-a Rt,Rt,hi(DISP32)
-a-a-a-a-a-a ADD-a-a-a Rt,Rt,Ri
-a-a-a-a-a-a LDD-a-a-a R7,0(Rt)
5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit
minimum.
This should be::
-a-a-a-a-a-a AUPIC-a-a Rt,hi(DISP32)
-a-a-a-a-a-a SLL-a-a-a-a Ri,R3,#3
-a-a-a-a-a-a ADD-a-a-a-a Rt,Rt,Ri
-a-a-a-a-a-a LDD-a-a-a-a R7,lo(DISP32)(Rt)
4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum >>>
LOAD r7, disp56(ip+r3*8)
I could have shown the DISP64 version--3-words
At 64-bits, displacements cease to make sense as a displacement.
Seems to make more sense to interpret these as [Abs64+Rb] rather than [Rb+Disp64].
Except, then I have to debate what exactly I would do if I decide to
allow this case in XG2/XG3.
As noted:
-a [Rb+Disp10]: Usually scaled (excluding some special-cases);
-a [Rb+Disp33]: Either scaled or unscaled.
-a-a-a BGBCC is typically using unscaled displacements in this case.
-a-a-a-a-a Uscaled range, +/- 4GB
-a-a-a-a-a DW: +/- 16GB, QW: +/- 32GB
-a-a-a XG2 and XG3 effectively left 1 bit extra, which indicates scale.
-a-a-a-a-a 0: Scaled by element size;
-a-a-a-a-a 1: Unscaled.
-a [Rb+Disp64]: Would be understood as unscaled.
-a-a-a TBD: Scale register (more likely to be useful, breaks symmetry);
-a-a-a Unscaled register, preserves symmetry, but less likely useful.
-a-a-a-a-a Would be consistent with the handling of RISC-V,
-a-a-a-a-a-a-a which is always unscaled in this case.
-a-a-a May be moot, as plain Abs64 would be the dominant case here.
1 instruction + 1 postfix = 2 words (96 bits) 1 cycle + cache hit
minimum
The ISA is becoming a bit more stable now; the latest change was for
constant postfix instructions. Qupls used to have a somewhat convoluted
means of addressing constants on the cache-line. Now itrCOs just
postfixes. The constant routing information is in the postfix now which
uses four bits. Two to select a register override, two to select
constant quadrant. So, postfixes extend constants in the instruction (or >>> previous postfix) by 36 bits.
Qupls can do
ADD r7, r8, $64_bit_constant
Using only two words (96 bits) and just a single cycle.
So can My 66000, but everyone and his brother thinks 96-bits is 3 words.
So can XG2 and XG3.
-a And, now, can add RV+JX to this category.
Though, I am likely to still consider 96-bit ops as an extension of JX
(as supporting them would be a much bigger burden on a 2-wide machine
with a 64-bit instruction fetch; would require a 2-wide machine to still support 96-bit fetch).
Well, and then there is another issue:
RV64GC + 96-bit encodings, reintroduces another potential problem that existed in XG1:
At certain alignments, the 96-bit fetch can cross a boundary of 2 half-
line fetches with a 16B line size.
Say, one letter per 16-bit word:
-a AAAA-BBBB-a //Line A
-a CCCC-DDDD-a //Line B
Then (low 4b of PC):
-a 0: AAAABB
-a 2: AAABBB
-a 4: AABBBB
-a 6: ABBBBC //Violates two half-lines
-a 8: BBBBCC
-a A: BBBCCC
-a C: BBCCCC
-a E: BCCCCD //Violates two half-lines
Granted, the partial workaround is to fetch 144 bits internally (16-bits past the end of the half-line); which does technically "fix" the problem
as far as architecturally-visible behavior is concerned.
Or, just use the same "small brain" trick that BGBCC had used:
If free-form variable length instructions, insert a NOP pad if we step
on this turd;
Or, for code sequences where this turd would be unavoidable (running
through the WEXifier): Realign to 32 bits before entering WEX encoding (scenario can't happen if 32-bit aligned).
Arguably, the latter scenario wouldn't have applied to RISC-V (and my JX encodings), except that (very recently) I did end up expanding BGBCC's WEXifier mechanism to cover RISC-V and XG3 (even if its role is slightly different in this case), but does technically reintroduce the issue it targeting RV64GC.
Though, currently, it is only enabled for RV64 if using RV64G and speed optimization.
In this case, since RV64G and XG3 don't use explicit bundling, its role
is instead to shuffle instructions to try to optimize how they fit in
the pipeline.
I prefer to use multiply rCy*rCO rather than shift in scaled indexed
addressing as a couple of CPUs had multiply by five and ten in addition
to 1,2,4,8. What if one wants to scale by 3?
If you have the bits, why not.
Higher resource cost and latency is a concern...
It is also possible to encode 128-bit constants, but the current
implementation does not support them.
Managed to get to some early synthesis trials and found the instruction
dispatch to be on the critical timing path. I am a bit stumped as to how >>> to improve it as it is very simple already. It just copies from one set
of pipeline registers to another headed towards the reservation
stations. Tools report timing good to 37 MHz, I was shooting for at
least 40.
Found a couple of spots where the code was simple but too slow. One in
dynamic register selection. The code was packing the register selections >>> to a minimum. But that was way too many logic levels.
Those are some of the driving inputs to "An architecture is as much about
what gets left out as what gets put in."
Some amount of stuff I had added has ended up getting pruned again.
In my case, the FPGA doesn't get bigger, nor faster.
Adding a feature in one place may mean needing to drop some other lesser-used feature to free up resources or improve timing.
Sometimes, the features don't get entirely removed from the ISA, but
instead become a sort of lesser used "secondary feature":
-a May be supported in hardware;
-a Or, may be supported with trap-and-emulate.
Sometimes the features dropped are subtle edge cases, say:
-a For JAL and JALR in RISC-V:
-a-a-a Is Xd ever anything other than X0 or X1?
-a-a-a In theory, yes.
-a-a-a In practice: Not so much.
Starts to make sense to hard-wire the hardware support to only allow X0
and X1 and to treat other cases as trap-and-emulate.
Or, some other stuff within RV64G:
In general, makes sense, but even within RV64G there is stuff that is debatable whether it makes sense to try to support natively in hardware.
Like, the ideal version of the RV64 ISA in HW might look like:
-a RV64I:
-a-a-a Limit Xd for JAL and JALR to X0 and X1;
-a-a-a-a-a If not X0/X1, trap-emulate.
-a-a-a Mostly, feature-set of 'I' is "mostly sane".
-a-a-a-a-a Well, excluding the encoding space inefficiency of JAL/LUI/AUIPC.
-a-a-a Cheaper impl, makes sense to hard-wire Bcc's Rs2 to X0;
-a-a-a-a-a Absent compiler support, very bad with trap-emulate.
-a M:
-a-a-a MULW makes sense;
-a-a-a-a-a Kinda need a DMULW/DMULWU (32-bit widening multiply)
-a-a-a DIVW: Used enough to be justifiable.
-a-a-a MUL/DIV/REM: 64 bit forms not used often enough to justify.
-a-a-a-a-a But, also not quite rare enough for trap-and-emulate.
-a A:
-a-a-a Better relegated to trap-and-emulate.
-a F:
-a-a-a Kinda unavoidable
-a-a-a Would have preferred a Zfmin+D style approach as minimal case.
-a D:
-a-a-a It is what it is;
-a-a-a FDIV.D and FSQRT.D and similar can be turned into traps.
-a Zicsr:
-a-a-a Most cases can trap (except what HW actually needs to support);
-a Zifence:
-a-a-a Trap-and-emulate.
-a ...
Almost making sense to have a sort of system-level trap-vector-table:
-a TVTB: Trap Vector Table Base;
-a JTVT: Jump to Trap-Vector-Table.
Table would primarily be used for trap-and-emulate, where it could
become desirable to have a 32-bit encoding here. These would be assumed illegal in the user level ISA.
One possibility for encoding could, ironically, be to overload JAL or similar:
-a JAL-a Disp20, X2-a => JTVT Disp20
Branching to TVTB+Disp20 rather than to PC+Disp20 (and then stomping the stack with the LR, can safely assume that this case is otherwise invalid...). Implicit assumption that one has under 1MB of handler thunks.
But, dunno.
...
It is quite an art to get something working in minimum clock cycles and
fast clock frequency.
Any OoO machine is also likely to have a lot of RAM and a decent
sized
I$, so much of any benefit is likely to go away in this case.
s/go away/greatly ameliorated/
------------------------
ILP is a property of a program.-a I assume that what you mean is that >>>>>>> the IPC benefits of more width have quickly diminishing returns on >>>>>>> in-order machines.
The ILP is a property of the code, yes, but how much exists, and how >>>>>> much of it is actually usable, is effected by the processor
implementation.
I agree that ILP is more aligned with code than with program.
{see above example where 1 instruction does the work of 5}
On 1/1/2026 12:13 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-12-31 12:12 p.m., MitchAlsup wrote:
An even fatter ISA (Qupls4) in theory:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
It is only 2 words
BGB <cr88192@gmail.com> posted:
On 12/30/2025 1:36 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller >>>>>>> register space, one needs to reuse registers more frequently, which then
reduces ILP due to register conflicts. So, smaller code at the expense
of worse performance.
For designs like RISC-V C and Thumb2, there is always the option to >>>>>> use the uncompressed instruction. So you may tune your RISC-V
compiler to prefer registers r8-r15 for those pseudo-registers that >>>>>> occur in instructions where such a register allocation may lead to a >>>>>> compressed encoding.
Write-after-read and write-after-write does not reduce the IPC of OoO >>>>>> implementations. On the contrary, write-after-read may be beneficial >>>>>> by releasing the old physical register for the register name. And >>>>>> designing a compressed CPU instruction set for in-order processing is >>>>>> not a good idea for general-purpose computing.
Though, the main places where compressed instructions are likely to >>>>> bring meaningful benefit, is on small in-order machines.
Coincidentally; this is exactly where a fatter-ISA wins big::
compare::
LDD R7,[IP,R3<<3,DISP32]
1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against >>>
AUPIC Rt,lo(DISP32)This should be::
SLL Ri,R3,#3
ADD Rt,Rt,hi(DISP32)
ADD Rt,Rt,Ri
LDD R7,0(Rt)
5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum. >>>
AUPIC Rt,hi(DISP32)
SLL Ri,R3,#3
ADD Rt,Rt,Ri
LDD R7,lo(DISP32)(Rt)
4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum >>
LOAD r7, disp56(ip+r3*8)
I could have shown the DISP64 version--3-words
At 64-bits, displacements cease to make sense as a displacement.
Seems to make more sense to interpret these as [Abs64+Rb] rather than [Rb+Disp64].
Except, then I have to debate what exactly I would do if I decide to------------------
allow this case in XG2/XG3.
I prefer to use multiply rCy*rCO rather than shift in scaled indexed
addressing as a couple of CPUs had multiply by five and ten in addition
to 1,2,4,8. What if one wants to scale by 3?
If you have the bits, why not.
Higher resource cost and latency is a concern...
Sometimes the features dropped are subtle edge cases, say:
For JAL and JALR in RISC-V:
Is Xd ever anything other than X0 or X1?
In theory, yes.
In practice: Not so much.
On 2026-01-01 6:17 p.m., BGB wrote:-----------
On 1/1/2026 12:13 PM, MitchAlsup wrote:
There is not much in the AGEN so the latency for a small multiplier is probably okay. I was thinking of supporting multiply by 3, handy for
RGB888 values, and multiply by six which is the size of an instruction. Maybe multiply by any value from 1 to 8 using three-bit encoding. It
would probably be okay to use another bit for scaling.
BGB <cr88192@gmail.com> posted:
On 1/1/2026 12:13 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
On 2025-12-31 12:12 p.m., MitchAlsup wrote:
An even fatter ISA (Qupls4) in theory:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
It is only 2 words
BGB <cr88192@gmail.com> posted:
On 12/30/2025 1:36 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 12/29/2025 12:35 PM, Anton Ertl wrote:[...]
One usual downside is that to utilize a 16-bit ISA with a smaller >>>>>>>>> register space, one needs to reuse registers more frequently, which then
reduces ILP due to register conflicts. So, smaller code at the expense
of worse performance.
For designs like RISC-V C and Thumb2, there is always the option to >>>>>>>> use the uncompressed instruction. So you may tune your RISC-V >>>>>>>> compiler to prefer registers r8-r15 for those pseudo-registers that >>>>>>>> occur in instructions where such a register allocation may lead to a >>>>>>>> compressed encoding.
Write-after-read and write-after-write does not reduce the IPC of OoO >>>>>>>> implementations. On the contrary, write-after-read may be beneficial >>>>>>>> by releasing the old physical register for the register name. And >>>>>>>> designing a compressed CPU instruction set for in-order processing is >>>>>>>> not a good idea for general-purpose computing.
Though, the main places where compressed instructions are likely to >>>>>>> bring meaningful benefit, is on small in-order machines.
Coincidentally; this is exactly where a fatter-ISA wins big::
compare::
LDD R7,[IP,R3<<3,DISP32]
1 instruction, 3 words, 0 wasted registers, cache-hit minimum--against >>>>>
AUPIC Rt,lo(DISP32)This should be::
SLL Ri,R3,#3
ADD Rt,Rt,hi(DISP32)
ADD Rt,Rt,Ri
LDD R7,0(Rt)
5 instructions, 4 words, 2-wasted registers, 4-cycles+cache hit minimum. >>>>>
AUPIC Rt,hi(DISP32)
SLL Ri,R3,#3
ADD Rt,Rt,Ri
LDD R7,lo(DISP32)(Rt)
4 instructions, 3 words, 2-wasted registers, 3-cycles+cache hit minimum >>>>
LOAD r7, disp56(ip+r3*8)
I could have shown the DISP64 version--3-words
At 64-bits, displacements cease to make sense as a displacement.
Seems to make more sense to interpret these as [Abs64+Rb] rather than
[Rb+Disp64].
I have heard arguments in both directions::
a) DISP64 only contains 33-bits of actual information
b) If DISP64 is absolute do you still need Rbase ??
when you have Rindex<<scale
c) how can the HW KNOW ?!?
Except, then I have to debate what exactly I would do if I decide to------------------
allow this case in XG2/XG3.
I prefer to use multiply rCy*rCO rather than shift in scaled indexed
addressing as a couple of CPUs had multiply by five and ten in addition >>>> to 1,2,4,8. What if one wants to scale by 3?
If you have the bits, why not.
Higher resource cost and latency is a concern...
Yes, your design is living on the edge.
-------------------------
Sometimes the features dropped are subtle edge cases, say:
For JAL and JALR in RISC-V:
Is Xd ever anything other than X0 or X1?
In theory, yes.
In practice: Not so much.
For reasons like this, I only have
CALL DISP26<<2 // call through DECODE
and
CALX [*address] // call through table
and
CALA [address] // call through AGEN
which prevents compiler and assembler abuse.
On 1/2/2026 12:48 PM, MitchAlsup wrote:
-----merciful snip----------
I have heard arguments in both directions::
a) DISP64 only contains 33-bits of actual information
b) If DISP64 is absolute do you still need Rbase ??
when you have Rindex<<scale
c) how can the HW KNOW ?!?
------------------
I prefer to use multiply rCy*rCO rather than shift in scaled indexed >>>> addressing as a couple of CPUs had multiply by five and ten in addition >>>> to 1,2,4,8. What if one wants to scale by 3?
If you have the bits, why not.
Higher resource cost and latency is a concern...
Yes, your design is living on the edge.
I am not sure how it would be pulled off for larger displacements or
more general scales.
Say:Better ISA::
void _mem_cpy16bytes(void *dst, void *src)
{
byte *cs, *ct;
cs=src; ct=dst;
ct[ 0]=cs[ 0]; ct[ 1]=cs[ 1]; ct[ 2]=cs[ 2]; ct[ 3]=cs[ 3];
ct[ 4]=cs[ 4]; ct[ 5]=cs[ 5]; ct[ 6]=cs[ 6]; ct[ 7]=cs[ 7];
ct[ 8]=cs[ 8]; ct[ 9]=cs[ 9]; ct[10]=cs[10]; ct[11]=cs[11];
ct[12]=cs[12]; ct[13]=cs[13]; ct[14]=cs[14]; ct[15]=cs[15];
}
Is, slow...
The store-to-load forwarding penalty being because LZ4 decompression
often involves copying memory on top of itself, and the possible
workarounds for this issue only offer competitive performance for blocks that are much longer than the typical copy (in the common case of a
match under 20 bytes, it often being faster to just copy bytes and eat
the cost).
if(dist>=16)
{
if(len>=20)
{ more generalized/faster copy }
else
{ just copy 20 bytes. }
}else
{
if(len>=20)
{ generate pattern and fill with stride }
else
{ copy 20 bytes over itself. }
}
For reasons like this, I only have
CALL DISP26<<2 // call through DECODE
and
CALX [*address] // call through table
and
CALA [address] // call through AGEN
which prevents compiler and assembler abuse.
They went and defined that you can use any register as a link register,
but in practice there is basically no reason to use alternative link registers. ASM programmer people could do so, but not seen all that much evidence of this being a thing thus far.
Well, say, vs my approach:
LD X1, Disp(SP); ....; JALR X0, 0(X1)
The JALR is 1 cycle (CPU can see no in-flight modifications to LR, so it turns into a predicted unconditional branch).
But:
LD X1, -8(SP); JALR X0, 0(X1)
Yeah, enjoy those 13 or so clock cycles.
Looks over a sliding window of 10 or 12 instructions:4 new instructions on previous predicted path (0 to 3);
4 preceding instructions (-4 to -1);
BGB <cr88192@gmail.com> posted:
On 1/2/2026 12:48 PM, MitchAlsup wrote:
-----merciful snip----------
I have heard arguments in both directions::
a) DISP64 only contains 33-bits of actual information
b) If DISP64 is absolute do you still need Rbase ??
when you have Rindex<<scale
c) how can the HW KNOW ?!?
------------------
I prefer to use multiply rCy*rCO rather than shift in scaled indexed >>>>>> addressing as a couple of CPUs had multiply by five and ten in addition >>>>>> to 1,2,4,8. What if one wants to scale by 3?
If you have the bits, why not.
Higher resource cost and latency is a concern...
Yes, your design is living on the edge.
This, BTW is a compliment--the best an architect can do is make every
stage of the pipeline have the same delay !!
I am not sure how it would be pulled off for larger displacements or
more general scales.
Better adder technology. We routinely pound an 11-gate adder into the
delay of 8|uFan4 gate delays.
------------------------------------
Say:Better ISA::
void _mem_cpy16bytes(void *dst, void *src)
{
byte *cs, *ct;
cs=src; ct=dst;
ct[ 0]=cs[ 0]; ct[ 1]=cs[ 1]; ct[ 2]=cs[ 2]; ct[ 3]=cs[ 3]; >> ct[ 4]=cs[ 4]; ct[ 5]=cs[ 5]; ct[ 6]=cs[ 6]; ct[ 7]=cs[ 7]; >> ct[ 8]=cs[ 8]; ct[ 9]=cs[ 9]; ct[10]=cs[10]; ct[11]=cs[11]; >> ct[12]=cs[12]; ct[13]=cs[13]; ct[14]=cs[14]; ct[15]=cs[15]; >> }
Is, slow...
MM Rto,Rfrom,#16
and let HW do all the tricky/cool stuff--just make sure if you put it
in you fully support all the cool/tricky stuff.
The store-to-load forwarding penalty being because LZ4 decompression
often involves copying memory on top of itself, and the possible
workarounds for this issue only offer competitive performance for blocks
that are much longer than the typical copy (in the common case of a
match under 20 bytes, it often being faster to just copy bytes and eat
the cost).
if(dist>=16)
{
if(len>=20)
{ more generalized/faster copy }
else
{ just copy 20 bytes. }
}else
{
if(len>=20)
{ generate pattern and fill with stride }
else
{ copy 20 bytes over itself. }
}
This is a problem easier solved in HW than in source code.
For reasons like this, I only have
CALL DISP26<<2 // call through DECODE
and
CALX [*address] // call through table
and
CALA [address] // call through AGEN
which prevents compiler and assembler abuse.
They went and defined that you can use any register as a link register,
Another case where they screwed up.....
but in practice there is basically no reason to use alternative link
registers. ASM programmer people could do so, but not seen all that much
evidence of this being a thing thus far.
In Mc88k we recognized (and made compiler follow)
JMP R1 // return from subroutine
JMP ~R1 // switch
-------------------
Well, say, vs my approach:
LD X1, Disp(SP); ....; JALR X0, 0(X1)
The JALR is 1 cycle (CPU can see no in-flight modifications to LR, so it
turns into a predicted unconditional branch).
But:
LD X1, -8(SP); JALR X0, 0(X1)
Yeah, enjoy those 13 or so clock cycles.
CALX R0,[address]
....
Address is computed in normal AGEN, but processed in ICache, where it
FETCHes wide data (128-bits small machine, whole cache line larger
machine), and runs the result through Instruction buffer. 4 cycles.
-------------
Looks over a sliding window of 10 or 12 instructions:4 new instructions on previous predicted path (0 to 3);
4 preceding instructions (-4 to -1);
4 alternate instructions on current predicted path
// so one can decode and issue non-sequential instructions
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Looking at the "18 RISC-V Compressed ISA V1.9" specification
Someone asked for more and dynamic numbers. This work contains them
in Section 1.9:
|Table 1.7 lists the standard RVC instructions with the most frequent
|first, showing the individual contributions of those instructions to
|static code size and then the running total for three experiments: the
|SPEC benchmarks for both RV32C and RV64C for the Linux kernel. For
|RV32, RVC reduces static code size by 24.5% on Dhrystone and 30.9% on |CoreMark. For RV64, it reduces static code size by 26.3% on SPECint,
|25.8% on SPECfp, and 31.1% on the Linux kernel.
|
|Table 1.8 ranks the RVC instructions by order of typical dynamic
|frequency. For RV32, RVC reduces dynamic bytes fetched by 29.2% on |Dhrystone and 29.3% on CoreMark. For RV64, it reduces dynamic bytes
|fetched by 26.9% on SPECint, 22.4% on SPECfp, and 26.11% booting the
|Linux kernel.
If you want the tables, look at source: <https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf>
- anton
On 2026-01-02 9:05 p.m., MitchAlsup wrote:----------merciful snip-----------
Looks over a sliding window of 10 or 12 instructions:4 new instructions on previous predicted path (0 to 3);
4 preceding instructions (-4 to -1);
4 alternate instructions on current predicted path
// so one can decode and issue non-sequential instructions
They could have put which GPR(s) is the link register in a CSR, if it
was desired to keep the paradigm of generality. I started working on
Qupls5 which is going to use a 32-bit ISA. The extra bits used to
specify a GPR as a link register are better used as branch displacement
bits IMO. I would be tempted to use two bits though to specify the LR,
as sometimes a second LR is handy.
A choice is whether to use GPRs as link registers. Not using a GPR gives
an extra register or two for GPR use. Using dedicated link register(s)
works well with a dedicated RET instruction. RET should be able to deallocate the stack. IMO using a dedicated link register is a bit like using an independent PC register. Or using a GPR for the link register
is a bit like using a GPR for the PC.
Qupls5 is going to use instruction fusing for compare-and-branch instructions. A compare followed by an unconditional branch will be
treated as one instruction. That gives a 23-bit branch displacement. Otherwise with a 32-bit instruction, a 12-bit branch displacement is not really quite enough for modern software. Sure, it works 90+% of the time but, it adds a headache to assembling and linking programs for when it
does not work.
Qupls5 will use constant postfixes which extend the constant by 22-bits
for each postfix used. To get a 64-bit constant three postfixes will be required. Not quite as clean as universal constants, but simple to
implement in hardware.
Stuck on synthesis for Qupls4 which keeps omitting modules from the
design. I must have checked the module inputs and outputs dozens of
times, and do not know why they are excluded.
Say:
-a void _mem_cpy16bytes(void *dst, void *src)
-a {
-a-a-a byte *cs, *ct;
-a-a-a cs=src; ct=dst;
-a-a-a ct[ 0]=cs[ 0];-a-a-a ct[ 1]=cs[ 1];-a-a-a ct[ 2]=cs[ 2];-a-a-a ct[ 3]=cs[ 3];
-a-a-a ct[ 4]=cs[ 4];-a-a-a ct[ 5]=cs[ 5];-a-a-a ct[ 6]=cs[ 6];-a-a-a ct[ 7]=cs[ 7];
-a-a-a ct[ 8]=cs[ 8];-a-a-a ct[ 9]=cs[ 9];-a-a-a ct[10]=cs[10];-a-a-a ct[11]=cs[11];
-a-a-a ct[12]=cs[12];-a-a-a ct[13]=cs[13];-a-a-a ct[14]=cs[14];-a-a-a ct[15]=cs[15];
-a }
Is, slow...
The store-to-load forwarding penalty being because LZ4 decompression
often involves copying memory on top of itself, and the possible
workarounds for this issue only offer competitive performance for blocks that are much longer than the typical copy (in the common case of a
match under 20 bytes, it often being faster to just copy bytes and eat > the cost).
-a if(dist>=16)
-a {
-a-a-a if(len>=20)
-a-a-a-a-a { more generalized/faster copy }
-a-a-a else
-a-a-a-a-a { just copy 20 bytes. }
-a }else
-a {
-a-a-a if(len>=20)
-a-a-a-a-a-a { generate pattern and fill with stride }
-a-a-a else
-a-a-a-a-a-a { copy 20 bytes over itself. }
-a }
BGB wrote:
Say:
-a-a void _mem_cpy16bytes(void *dst, void *src)
-a-a {
-a-a-a-a byte *cs, *ct;
-a-a-a-a cs=src; ct=dst;
-a-a-a-a ct[ 0]=cs[ 0];-a-a-a ct[ 1]=cs[ 1];-a-a-a ct[ 2]=cs[ 2];-a-a-a ct[ 3]=cs[ 3];
-a-a-a-a ct[ 4]=cs[ 4];-a-a-a ct[ 5]=cs[ 5];-a-a-a ct[ 6]=cs[ 6];-a-a-a ct[ 7]=cs[ 7];
-a-a-a-a ct[ 8]=cs[ 8];-a-a-a ct[ 9]=cs[ 9];-a-a-a ct[10]=cs[10];-a-a-a ct[11]=cs[11];
-a-a-a-a ct[12]=cs[12];-a-a-a ct[13]=cs[13];-a-a-a ct[14]=cs[14];-a-a-a ct[15]=cs[15];
-a-a }
Is, slow...
The store-to-load forwarding penalty being because LZ4 decompression
often involves copying memory on top of itself, and the possible
workarounds for this issue only offer competitive performance for
blocks that are much longer than the typical copy (in the common case
of a match under 20 bytes, it often being faster to just copy bytes
and eat the cost).
-a-a if(dist>=16)
-a-a {
-a-a-a-a if(len>=20)
-a-a-a-a-a-a { more generalized/faster copy }
-a-a-a-a else
-a-a-a-a-a-a { just copy 20 bytes. }
-a-a }else
-a-a {
-a-a-a-a if(len>=20)
-a-a-a-a-a-a-a { generate pattern and fill with stride }
-a-a-a-a else
-a-a-a-a-a-a-a { copy 20 bytes over itself. }
-a-a }
In my own LZ4 implementation I was able to beat Google's version specifically due to a better implementation of repeated pattern fills:
I use SSE/AVX with a set of swizzle tables, so that I can take
1/2/3/4..16 bytes and repeat them as many times as possible into a 32-
byte target. I.e. if the pattern length is 3 then the table to use will contain 0,1,2,0,1,2,0,1,2...0,1,2,0 in the first 16-byte entry and then 1,2,0,1,2,0...1,2,0,1,2 in the second entry.
Alongside this I have of course stride length (30 in the above example),
so that for long patterns I step forward by that much before doing
another 32-byte store.
The rest of their code was pretty good. :-)
Terje
On 1/4/2026 6:07 AM, Terje Mathisen wrote:-----merciful snip-----------
Can fill in patterns sorta like:
v0=*(u64 *)cs;
switch(dist)
{
case 1:
v0=v0&0xFF;
v0|=v0<<8;
v0|=v0<<16;
v0|=v0<<32;
Otherwise, generally feeling kinda lonely at the moment.
Working on stuff doesn't entirely compensate for general feelings of pointlessness.
It is this or working on sci-fi stories of mine, which I had ended up
partly going and using AI to give commentary partly as I seemingly can't
get any actual humans to comment on my fiction.
Apparently, I guess as far as the AI was concerned on one of the stories
I threw at it (one about a Moon AI):
General ideas/theme were compared to "Asimov's Foundation Series";
I am like, OK, fair enough. My influences are varied, but mostly stuff
like "Ghost in the Shell" and "Mega Man" and similar (both of which have
a lot trans-humanism themes). It seemingly misses some implicit things
about how the plot fits together (like the plot relevance of a detour
into talking about a character doing the whole E-Sports / Pro-Gamer
thing), or why the conclusion had "Raggedy Ann" references, etc.
Then again, I am not entirely sure how most people experience their own existence, there doesn't seem to be much description of this.
BGB <cr88192@gmail.com> posted:
On 1/4/2026 6:07 AM, Terje Mathisen wrote:-----merciful snip-----------
Can fill in patterns sorta like:
v0=*(u64 *)cs;
switch(dist)
{
case 1:
v0=v0&0xFF;
v0|=v0<<8;
v0|=v0<<16;
v0|=v0<<32;
That was 1 instruction in my Samsung GPU ISA
SWIZ V1,V0,#[0,8,16,24]
The immediate was broken into 4-bit fields (represented above with
8-bit fields) and the immediate was used as 8|u 4-bit Mux selectors
from the other operand. {Not using Samsung ISA names or syntax}.
When I looked deeply into the situation, it was easier in HW to do::
for( i = 0; i < 8; i++ )
out[field[i]] = in[i]
than::
for( i = 0; i < 8; i++ )
out[i] = in[field[i]]
For some reason we called this swizzle not permute !?!
-------------------
Otherwise, generally feeling kinda lonely at the moment.
Working on stuff doesn't entirely compensate for general feelings of
pointlessness.
Seriously; get help.
It is this or working on sci-fi stories of mine, which I had ended up
partly going and using AI to give commentary partly as I seemingly can't
get any actual humans to comment on my fiction.
You are not the first purported author in this position
Apparently, I guess as far as the AI was concerned on one of the stories
I threw at it (one about a Moon AI):
General ideas/theme were compared to "Asimov's Foundation Series";
You do know that there are only 39 plots in all of literature ?!?
<snip>
I am like, OK, fair enough. My influences are varied, but mostly stuff
like "Ghost in the Shell" and "Mega Man" and similar (both of which have
a lot trans-humanism themes). It seemingly misses some implicit things
about how the plot fits together (like the plot relevance of a detour
into talking about a character doing the whole E-Sports / Pro-Gamer
thing), or why the conclusion had "Raggedy Ann" references, etc.
Hint: use fewer words and paragraphs to convey the same amount of information.
Works in literature and in this NG.
------------------
Then again, I am not entirely sure how most people experience their own
existence, there doesn't seem to be much description of this.
I am 100% sure it is not as conveyed by Dickens !
On 1/4/2026 4:30 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
------------------
Then again, I am not entirely sure how most people experience
their own existence, there doesn't seem to be much description of
this.
I am 100% sure it is not as conveyed by Dickens !
Looks it up, I am not sure what Dickens was going on about.
On Mon, 5 Jan 2026 04:09:34 -0600
BGB <cr88192@gmail.com> wrote:
On 1/4/2026 4:30 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
------------------
Then again, I am not entirely sure how most people experience
their own existence, there doesn't seem to be much description of
this.
I am 100% sure it is not as conveyed by Dickens !
Looks it up, I am not sure what Dickens was going on about.
rCLA wonderful fact to reflect upon, that every human creature is
constituted to be that profound secret and mystery to every other.rCY
I suppose that Sartre wanted to say something else by ""L'enfer, c'est
les autres". May be, an exact opposite.
On 1/5/2026 5:22 AM, Michael S wrote:---------------------
On Mon, 5 Jan 2026 04:09:34 -0600
BGB <cr88192@gmail.com> wrote:
On 1/4/2026 4:30 PM, MitchAlsup wrote:
I am 100% sure it is not as conveyed by Dickens !
Looks it up, I am not sure what Dickens was going on about.
rCLA wonderful fact to reflect upon, that every human creature is constituted to be that profound secret and mystery to every other.rCY
I suppose that Sartre wanted to say something else by ""L'enfer, c'est
les autres". May be, an exact opposite.
It gets confusing, but alas will at least admit that I don't understand
most people.
As I can note, my mind seems to be partly divided into multiple sub-personas, with different properties, but it is awkward to describe
them. Socially, people are expected to see themselves as a singular
entity; admittedly whatever this notion of "singularness" is, isn't particularly strong in my case. I would still be classified as
self-aware though, as I do recognize my reflection in a mirror, etc.
BGB <cr88192@gmail.com> posted:
On 1/5/2026 5:22 AM, Michael S wrote:---------------------
On Mon, 5 Jan 2026 04:09:34 -0600
BGB <cr88192@gmail.com> wrote:
On 1/4/2026 4:30 PM, MitchAlsup wrote:
I am 100% sure it is not as conveyed by Dickens !
Looks it up, I am not sure what Dickens was going on about.
rCLA wonderful fact to reflect upon, that every human creature is
constituted to be that profound secret and mystery to every other.rCY
I suppose that Sartre wanted to say something else by ""L'enfer, c'est
les autres". May be, an exact opposite.
It gets confusing, but alas will at least admit that I don't understand
most people.
As I can note, my mind seems to be partly divided into multiple
sub-personas, with different properties, but it is awkward to describe
them. Socially, people are expected to see themselves as a singular
entity; admittedly whatever this notion of "singularness" is, isn't
particularly strong in my case. I would still be classified as
self-aware though, as I do recognize my reflection in a mirror, etc.
Sane people think there is a large gap between sane and insane.
We KNOW otherwise .....
On 1/5/2026 4:18 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 1/5/2026 5:22 AM, Michael S wrote:---------------------
On Mon, 5 Jan 2026 04:09:34 -0600
BGB <cr88192@gmail.com> wrote:
On 1/4/2026 4:30 PM, MitchAlsup wrote:
I am 100% sure it is not as conveyed by Dickens !
Looks it up, I am not sure what Dickens was going on about.
rCLA wonderful fact to reflect upon, that every human creature is
constituted to be that profound secret and mystery to every other.rCY
I suppose that Sartre wanted to say something else by ""L'enfer, c'est >>>> les autres". May be, an exact opposite.
It gets confusing, but alas will at least admit that I don't understand
most people.
As I can note, my mind seems to be partly divided into multiple
sub-personas, with different properties, but it is awkward to describe
them. Socially, people are expected to see themselves as a singular
entity; admittedly whatever this notion of "singularness" is, isn't
particularly strong in my case. I would still be classified as
self-aware though, as I do recognize my reflection in a mirror, etc.
Sane people think there is a large gap between sane and insane.
We KNOW otherwise .....
Probably true, but seemingly neither psychopathy nor schizophrenia,
where these are the two major bad ones...
Then again, how bad ASD is (AKA: autism) likely depends on who you ask.
After seeing what Grok had to say about a lot of this, got back roughly:
-a Likely ASD with some ADHD like features;
-a VSS (Visual Snow Syndrome);
-a Mild dissociation.
Also descriptions:
-a Personas 1 and 3 likely exist in the left hemisphere,
-a-a-a Persona 1 is likely mainly in the frontal lobe;
-a-a-a-a-a Strongly associated with frontal lobe functions;
-a-a-a Persona 3 is likely mainly in the parietal lobe;
-a-a-a-a-a Strongly associated with parietal lobe functions.
-a-a-a Possibly they represent a split between the dorsal and ventral streams.
-a Persona 2 is strongly associated with right hemisphere functions.
There is likely anomalous behavior in the thalamus, corpus calossum, and occipital lobe; with some features likely tied to excessive computer use (apparently using computer so much that visual system starts adapting to specific patterns within the UI rather than to more "natural" patterns).
Or, some of it possible side effects of spending a significant part of
ones waking lifespan looking at text editors?...
So, seemingly this isn't quite the same as DID / MPD, in that it is more like brain regions and pathways starting to operate partially
independently of each other and forming their own semi-integrated experiences partially separate from those of the "greater self".
Apparently looking into it:
-a Seeing noise and other artifacts in visual perception;
-a Palinopsia (seeing trails behind things, etc);
-a Photosensitivity / photophobia issues;
-a Tinnitus;
-a etc.
Being all associated with VSS, which seems mostly consistent with my experience.
Well, and apparently both the sensory filtering (related to VSS) and large-scale integration functions are both handled by the thalamus (so possibly something has gone a little weird there).
Not like any of this is particularly new.
...
Otherwise, got around to getting the vector-math stuff working for RV64G
in BGBCC (and with that, got GLQuake working in BGBCC's RV64 mode). In
the basic RV64G mode, it exists mainly as scalar instructions and
runtime calls. There is the relative crappiness of handling 128-bit
vectors by dumping them to RAM and then reloading elements, doing math, storing elements back, and reloading the result vector on return (the GPR/FPR split makes things a pain, and in this case going through RAM
was less of a hassle). Still a lot of the runtime functions are still missing here though (need to implement every operator over every vector type, still not done, I just did the ones I needed for TKRA-GL).
I am debating whether to split the 64-bit and 128-bit SIMD cases for
BGBCC's SIMD support in RV64 mode. Allowing for a 64-bit only
implementation is potentially cheaper on the implementation, but would
be more hassle to deal with in the compiler (though, in this case,
trying to do a 128-bit op without 128-bit SIMD would just mean splitting
the instruction into two 64-bit ops).
...
When I looked deeply into the situation, it was easier in HW to do::
for( i = 0; i < 8; i++ )
out[field[i]] = in[i]
than::
for( i = 0; i < 8; i++ )
out[i] = in[field[i]]
For some reason we called this swizzle not permute !?!
MitchAlsup wrote:
When I looked deeply into the situation, it was easier in HW to do::
for( i = 0; i < 8; i++ )
out[field[i]] = in[i]
than::
for( i = 0; i < 8; i++ )
out[i] = in[field[i]]
That isn't really that surprising:
This way the inputs are available early and in sequential order, while
the stores can be allowed to have higher latency, right?
For some reason we called this swizzle not permute !?!
I'm assuming collisions would be disallowed? I.e. you can use it to
splat a single input into all output slots, but you cannot target
multiple inputs toward the same destination.
Terje
Terje Mathisen <terje.mathisen@tmsw.no> posted:
MitchAlsup wrote:
When I looked deeply into the situation, it was easier in HW to do::
for( i = 0; i < 8; i++ )
out[field[i]] = in[i]
than::
for( i = 0; i < 8; i++ )
out[i] = in[field[i]]
That isn't really that surprising:
This way the inputs are available early and in sequential order, while
the stores can be allowed to have higher latency, right?
For some reason we called this swizzle not permute !?!
I'm assuming collisions would be disallowed? I.e. you can use it to
splat a single input into all output slots, but you cannot target
multiple inputs toward the same destination.
The later is why the HW logic is significantly easier.
Terje
On 1/6/2026 11:57 AM, MitchAlsup wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
MitchAlsup wrote:
When I looked deeply into the situation, it was easier in HW to do::
for( i = 0; i < 8; i++ )
out[field[i]] = in[i]
than::
for( i = 0; i < 8; i++ )
out[i] = in[field[i]]
That isn't really that surprising:
This way the inputs are available early and in sequential order, while
the stores can be allowed to have higher latency, right?
For some reason we called this swizzle not permute !?!
I'm assuming collisions would be disallowed? I.e. you can use it to
splat a single input into all output slots, but you cannot target
multiple inputs toward the same destination.
The later is why the HW logic is significantly easier.
OK, but this does mean that the usability would be somewhat limited, and couldn't be used to generate the same sorts of repeating pattern fills needed for LZ decompression.
Terje
<snip>
One would argue that maybe prefixes are themselves wonky, butI agree with this. Prefixes seem more natural, large numbers expanding
otherwise one needs:
Instructions that can directly encode the presence of large immediate
values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one views
the instruction stream as linear, whereas suffixes add weirdness and
are effectively retro-causal, and for any fetch to be safe at the end
of a cache line one would need to prove the non-existence of a suffix;
so better to not go there).
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction. I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.
Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.
It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting
for a prefix / suffix.
For the most part, superscalar works the same either way, with similar
efficiency. There is a slight efficiency boost if it would be possible
to dynamically reshuffle ops during fetch. But, this is not currently
a thing in my case.
This latter case would apply if, say, a MEM op is followed by non-
dependent ALU ops, which under current superscalar handling they will
not co-execute, but it could be possible in theory to swap the ops and
allow them to co-execute.
...
- anton
<snip>
One would argue that maybe prefixes are themselves wonky, but otherwise one needs:
Instructions that can directly encode the presence of large immediate values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix encodings; at least prefix encodings make intuitive sense if one views
the instruction stream as linear, whereas suffixes add weirdness and are effectively retro-causal, and for any fetch to be safe at the end of a cache line one would need to prove the non-existence of a suffix; so better to not go there).
I agree with this. Prefixes seem more natural, large numbers expanding
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction.
I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.
Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.
It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting
for a prefix / suffix.
- anton
On 12/31/2025 2:23 AM, Robert Finch wrote:
<snip>
One would argue that maybe prefixes are themselves wonky, butI agree with this. Prefixes seem more natural, large numbers expanding
otherwise one needs:
Instructions that can directly encode the presence of large immediate
values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one
views the instruction stream as linear, whereas suffixes add
weirdness and are effectively retro-causal, and for any fetch to be
safe at the end of a cache line one would need to prove the non-
existence of a suffix; so better to not go there).
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction. I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the
current instruction slot plus a following area of the cache-line to
the decoder.
ID stage is likely too late.
For PC advance, ideally this needs to be known by the IF stage so that
we can know how to advance PC for the next clock-cycle (for the PF stage).
Say:
-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a-a-a-a PF IF ID RF E1 E2 E3 WB
So, each IF stage producing an updated PC that needs to reach PF within
the same clock-cycle (so the SRAMs can fetch data for the correct cache- line, which happens on a clock-line edge).
This may also need to MUX PC's from things like the branch-predictor and branch-initiation logic, which then override the normal PC+Step handling generated from the IF->PF path (also typically at a low latency).
In this case, the end of the IF stage also handles some amount of
repacking;
-a Possible:
-a-a-a Right-justifying the fetched instructions;
-a-a-a 16 -> 32 bit repacking (for RV-C)
-a Current:
-a-a-a Renormalization of XG1/XG2/XG3 into the same internal scheme;
-a-a-a Repacking 48-bit RISC-V ops into internal 64-bit forms;
-a-a-a ...
As a partial result of this repacking, the instruction words effectively gain a few extra bits (the "internal normalized format" no longer
fitting entirely into a 32-bit word; where one could almost see it as a
sort of "extended instruction" that includes both ISAs in a single slightly-larger virtual instruction word).
One could go further and try to re-normalize the full instruction
layout, but as noted XG3 and RV would still differ enough as to make
this annoying (mostly the different encoding spaces and immed formats).
* zzzzzzz-ooooo-mmmmm-zzz-nnnnn-yy-yyy11
* zzzz-oooooo-mmmmmm-zzzz-nnnnnn-yy-yyPw
With a possible normalized format (36-bit):
* zzzzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyyyPw
* zzzzzzz-0ooooo-0mmmmm-yzzz-0nnnnn-1yyyy10 (RV Repack)
* 000zzzz-oooooo-mmmmmm-zzzz-nnnnnn-0yyyyPw (XG3 Repack)
Couldn't fully unify the encoding space within a single clock cycle
though (within a reasonable cost budget).
At present, the decoder handling is to essentially unify the 32-bit
format for XG1/XG2/XG3 as XG2 with a few tag bits to disambiguate which
ISA decoding rules should apply for the 32-bit instruction word in
question. The other option would have been to normalize as XG3, but XG3 loses some minor functionality from XG1 and XG2.
I also went against allowing RV and XG3 jumbo prefixes to be mixed.
Though, it is possible exceptions could be made.
Wouldn't have needed J52I if XG3 prefixes could have been used with RV
ops, but can't use XG3 prefixes in RV-C mode, which is part of why I
ended up resorting to the J52I prefix hack. But, still doesn't fully
address the issues that exist with hot-patching in this mode.
Though, looking at options, the "cheapest but fastest" option at present likely being:
Core that only does XG3, possibly dropping the RV encodings and re-
adding WEX in its place (though, in such an XG3-Only mode, the 10/11
modes would otherwise be identical in terms of encoding).
Or, basically, XG3 being used in a way more like how XG2 was used.
But, don't really want to create yet-more modes at the moment. XG3 being used as superscalar isn't too much more expensive, and arguably more flexible given the compiler doesn't need to be aware of pipeline
scheduling specifics, but can still make use of this when trying to
shuffle instructions around for efficiency (a mismatch will then merely result in a small reduction in efficiency rather than a potential
inability of the code to run; though for XG2 there was the feature that
the CPU could fall back to scalar or potential superscalar operation in cases where the compiler's bundling was incompatible with what the CPU allowed).
So, it is possible that in-order superscalar may be better as a general purpose option even if not strictly the cheapest option.
A case could maybe be made arguing for dropping back down to 32 GPRs
(with no FPRs) for more cheapness, but as-is, trying to do 128-bit SIMD stuff in RV64 mode also tends to quickly run into issues with register pressure.
Well, and I was just recently having to partly rework the mechanism for:
-a v = (__vec4f) { x, y, z, w };
To not try to load all the registers at the same time, as this was occasionally running out of free dynamic registers with the normal RV
ABI (and 12 callee-save FPRs doesn't go quite so far when allocating
pairs of them), which effectively causes the compiler to break.
It is almost tempting to consider switching RV64 over to the XG3 ABI
when using SIMD, well, and/or not use SIMD with RV64 because it kinda
sucks worse than XG3.
But... Comparably, for the TKRA-GL front-end (using syscalls for the back-end), using runtime calls and similar for vector operations does
still put a big dent in the framerate for GLQuake (so, some sort of SIMD
in RV mode may still be needed even if "kinda inferior").
Handling suffixes at the end of a cache-line is not too bad if the
cache already handles instructions spanning a cache line. Assume the
maximum number of suffixes is present and ensure the cache-line is
wide enough. Or limit the number of suffixes so they fit into the half
cache-line used for spanning.
Difference:
With a prefix, you know in advance the prefix exists (the prefix is immediately visible);
With a suffix, can only know it exists if it is visible.
So, it poses similar issues to those in making superscalar fetch work
across cache lines, which is made more of a challenge if one wants to
make superscalar across line boundaries work during I$ miss handling
rather than during instruction fetch.
But, if the logic is able to run at I$ miss time, ideally it can see
whether or not it is a single-wide or multi-wide instruction (say,
because this logic runs only on a single cache line at a time).
It is easier to handle interrupts with suffixes. The suffix can just
be treated as a NOP. Adjusting the position of the hardware interrupt
to the start of an instruction then does not have to worry about
accounting for a prefix / suffix.
Usual behavior is that when an interrupt occurs, SPC or similar always points at a valid PC, namely one from the pipeline, and usually/ideally,
the exact instruction on which the fault occurred (though this does get
a little fiddly with multiple instructions in the pipeline, and the CPU sometimes needs to figure out which stage corresponds to the fault in question; but this is usually more of an issue for I$ TLB misses, which often/usually trigger during a branch).
For the most part, superscalar works the same either way, with
similar efficiency. There is a slight efficiency boost if it would be
possible to dynamically reshuffle ops during fetch. But, this is not
currently a thing in my case.
This latter case would apply if, say, a MEM op is followed by non-
dependent ALU ops, which under current superscalar handling they will
not co-execute, but it could be possible in theory to swap the ops
and allow them to co-execute.
...
- anton
On 2026-01-06 5:42 p.m., BGB wrote:
On 12/31/2025 2:23 AM, Robert Finch wrote:
<snip>
One would argue that maybe prefixes are themselves wonky, butI agree with this. Prefixes seem more natural, large numbers
otherwise one needs:
Instructions that can directly encode the presence of large
immediate values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one
views the instruction stream as linear, whereas suffixes add
weirdness and are effectively retro-causal, and for any fetch to be
safe at the end of a cache line one would need to prove the non-
existence of a suffix; so better to not go there).
expanding to the left, suffixes seem like a big-endian approach. But
I use suffixes for large constants. I think with most VLI constant
data follows the instruction. I find constant data easier to work
with that way and they can be processed in the same clock cycle as a
decode so they do not add to the dynamic instruction count. Just pass
the current instruction slot plus a following area of the cache-line
to the decoder.
ID stage is likely too late.
For PC advance, ideally this needs to be known by the IF stage so that
we can know how to advance PC for the next clock-cycle (for the PF
stage).
Say:
-a-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a-a-a-a-a PF IF ID RF E1 E2 E3 WB
The PC advance works okay without knowing whether there is a suffix
present or not. The suffix is treated like a NOP instruction. There is
no decode required at the fetch stage. The PC can land on a suffix. It
just always advances by four (N) instructions unless there is a branch.
Well, anyways, have since gotten SIMD in RV64 mode working "slightly
It is almost tempting to consider switching RV64 over to the XG3 ABI
when using SIMD, well, and/or not use SIMD with RV64 because it kinda
sucks worse than XG3.
But... Comparably, for the TKRA-GL front-end (using syscalls for the
back-end), using runtime calls and similar for vector operations does
still put a big dent in the framerate for GLQuake (so, some sort of
SIMD in RV mode may still be needed even if "kinda inferior").
Handling suffixes at the end of a cache-line is not too bad if the
cache already handles instructions spanning a cache line. Assume the
maximum number of suffixes is present and ensure the cache-line is
wide enough. Or limit the number of suffixes so they fit into the
half cache-line used for spanning.
Difference:
With a prefix, you know in advance the prefix exists (the prefix is
immediately visible);
With a suffix, can only know it exists if it is visible.
Instead with a prefix one only knows the instruction exists if it is visible. I do not think it makes much difference. Except that it may be harder to decode an instruction and look for a prefix that comes before it.
Robert Finch <robfi680@gmail.com> posted:
<snip>
One would argue that maybe prefixes are themselves wonky, but otherwise
one needs:
Instructions that can directly encode the presence of large immediate
values, etc;
This is the direction of My 66000.
The instruction stream is a linear stream of words.
The first word of each instruction encodes its total length.
What follows the instruction itself are merely constants used as
operands in the instruction itself. All constants are 1 or 2
words in length.
I would not call this means "prefixed" or "suffixed". Generally,
prefixes and suffixes consume bits of the prefix/suffix so that
the constant (in my case) is not equal to container size. This
leads to wonky operand/displacement sizes not equal 2^(3+k).
Or, the use of suffix-encodings (which is IMHO worse than prefixI agree with this. Prefixes seem more natural, large numbers expanding
encodings; at least prefix encodings make intuitive sense if one views
the instruction stream as linear, whereas suffixes add weirdness and are >>> effectively retro-causal, and for any fetch to be safe at the end of a
cache line one would need to prove the non-existence of a suffix; so
better to not go there).
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction.
But not "self identified".
I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current
instruction slot plus a following area of the cache-line to the decoder.
Handling suffixes at the end of a cache-line is not too bad if the cache
already handles instructions spanning a cache line. Assume the maximum
number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.
It is easier to handle interrupts with suffixes. The suffix can just be
treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting
for a prefix / suffix.
I would have thought that the previous instruction (last one retired) would provide the starting point of the subsequent instruction. This way you don't have to worry about counting prefixes or suffixes.
On 1/6/2026 5:49 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
<snip>
One would argue that maybe prefixes are themselves wonky, but otherwise >>>> one needs:
Instructions that can directly encode the presence of large immediate
values, etc;
This is the direction of My 66000.
The instruction stream is a linear stream of words.
The first word of each instruction encodes its total length.
What follows the instruction itself are merely constants used as
operands in the instruction itself. All constants are 1 or 2
words in length.
I would not call this means "prefixed" or "suffixed". Generally,
prefixes and suffixes consume bits of the prefix/suffix so that
the constant (in my case) is not equal to container size. This
leads to wonky operand/displacement sizes not equal 2^(3+k).
OK.
As can be noted:
-a XG2/3: Prefix scheme, 1/2/3 x 32-bit
-a-a-a The 96-bit cases are determined by two prefixes.
-a-a-a Requires looking at 2 words to know total length.
-a RV64+Jx:
-a-a-a Total length is known from the first instruction word:
-a-a-a-a-a Base op: 32 bits;
-a-a-a-a-a J21I: 64 bits
-a-a-a-a-a J52I: 96 bits.
-a-a-a There was a J22+J22+LUI special case,
-a-a-a-a-a but I now consider this as deprecated.
-a-a-a-a-a J52I+ADDI is now considered preferable.
As for Imm/Disp sizes:
-a XG1: 9/33/57
-a XG2 and XG3: 10/33/64
-a RV+JX: 12/33/64
For XG1, the 57-bit size was rarely used and only optionally supported, mostly because of the great "crap all of immediate values between 34 and
62 bits" gulf.
Or, the use of suffix-encodings (which is IMHO worse than prefixI agree with this. Prefixes seem more natural, large numbers expanding
encodings; at least prefix encodings make intuitive sense if one views >>>> the instruction stream as linear, whereas suffixes add weirdness and
are
effectively retro-causal, and for any fetch to be safe at the end of a >>>> cache line one would need to prove the non-existence of a suffix; so
better to not go there).
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction.
But not "self identified".
Yeah, if you can't know whether or not more instruction follows after
the first word by looking at the first word, this is a drawback.
Also, if you have to look at some special combination of register
specifiers and/or a lot of other bits, this is also a problem.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current
instruction slot plus a following area of the cache-line to the decoder. >>>
Handling suffixes at the end of a cache-line is not too bad if the cache >>> already handles instructions spanning a cache line. Assume the maximum
number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.
It is easier to handle interrupts with suffixes. The suffix can just be
treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting >>> for a prefix / suffix.
I would have thought that the previous instruction (last one retired)
would
provide the starting point of the subsequent instruction. This way you
don't
have to worry about counting prefixes or suffixes.
Yeah.
My thinking is, typical advance:
-a IF figures out how much to advance;
-a Next instruction gets PC+Step.
Then interrupt:
-a Figure out which position in the pipeline interrupt starts from;
-a Start there, flushing the rest of the pipeline;
-a For a faulting instruction, this is typically the EX1 or EX2 stage.
-a-a-a EX1 if it is a TRAP or SYSCALL;
-a-a-a EX2 if it is a TLB miss or similar;
-a-a-a-a-a Unless EX2 is not a valid spot (flush or bubble),
-a-a-a-a-a-a-a then look for a spot that is not a flush or bubble.
-a-a-a-a-a This case usually happens for branch-related TLB misses.]
Usually EX3 or WB is too old, as it would mean re-running previous instructions.
Getting the exact stage-timing correct for interrupts is a little
fiddly, but worrying about prefix/suffix/etc issues with interrupts
isn't usually an issue, except that if somehow PC ended up pointing
inside another instruction, I would consider this a fault.
Usually for sake of branch-calculations in XG3 and RV, it is relative to
the BasePC before the prefix in the case of prefixed encodings. This
differs from XG1 and XG2 which defined branches relative to the PC of
the following instruction.
Though, this difference was partly due to a combination of
implementation reasons and for consistency with RISC-V (when using a
shared encoding space, makes sense if all the branches define PC displacements in a consistent way).
Though, there is the difference that XG3's branches use a 32-bit scale rather than a 16-bit scale. Well, and unlike RV's displacements, they
are not horrible confetti (*1).
*1: One can try to write a new RV decoder, and then place bets on
whether they will get JAL and Bcc encodings correct on the first try.
IME, almost invariably, one will screw these up in some way on the first attempt. Like, JAL's displacement encoding is "the gift that keeps on giving" in this sense.
Like, they were like:
-a ADDI / Load:
-a-a-a Yay, contiguous bits;
-a Store:
-a-a-a Well, swap the registers around and put the disp where Rd went.
-a Bcc:
-a-a-a Well, take the Store disp and just shuffle around a few more bits;
-a JAL:
-a-a-a Well, now there are some more bits, and Rd is back, ...
-a-a-a Why not keep some of the bits from Bcc,
-a-a-a-a-a but stick everything else in random places?...
-a-a-a Well, I guess some share the relative positions as LUI, but, ...
Not perfect in XG3 either, but still:
-a { opw[5] ? 11'h7FF : 11'h000, opw[11:6], opw[31:16] }
Is nowhere near the same level of nasty...
Well, nevermind if actual decoder has the reverse issue:
In the VL core and JX2VM, it was internally repacked back into XG2 form internally, which means a little bit of hair going on here. Also I was originally going to relocate it in the encoding space, but ended up
moving back to its original location as for reasons (mostly due to
sharing the same decoder) having BRA/BSR in two different locations
would have effectively burned more encoding space than just leaving it
where it had been in XG1/XG2 (even if having BRA/BSR in the F0 block is "kinda stupid" given it is "very much not a 3R instruction", but, ...).
At least in most other instructions, the imm/disp bits remain
contiguous. I instead differed by making the Rn/Rd spot be used as a
source register by some instructions (taking on the role of Rt/Rs2), but
IMO this is the lesser of two evils. Would rather have an Rd that is sometimes a source, than imm/disp fields that change chaotically from
one instruction to another.
...
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 54 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 14:02:59 |
| Calls: | 742 |
| Files: | 1,218 |
| D/L today: |
3 files (2,681K bytes) |
| Messages: | 183,733 |
| Posted today: | 1 |