Forum: Too Lazy BBS

Re: Variable-length instructions

From BGB@cr88192@gmail.com to comp.arch on Tue Jan 6 16:42:29 2026

From Newsgroup: comp.arch

On 12/31/2025 2:23 AM, Robert Finch wrote:

<snip>

One would argue that maybe prefixes are themselves wonky, but
otherwise one needs:
Instructions that can directly encode the presence of large immediate
values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one views
the instruction stream as linear, whereas suffixes add weirdness and
are effectively retro-causal, and for any fetch to be safe at the end
of a cache line one would need to prove the non-existence of a suffix;
so better to not go there).

I agree with this. Prefixes seem more natural, large numbers expanding
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction. I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.

ID stage is likely too late.

For PC advance, ideally this needs to be known by the IF stage so that
we can know how to advance PC for the next clock-cycle (for the PF stage).

Say:
PF IF ID RF E1 E2 E3 WB
PF IF ID RF E1 E2 E3 WB
PF IF ID RF E1 E2 E3 WB

So, each IF stage producing an updated PC that needs to reach PF within
the same clock-cycle (so the SRAMs can fetch data for the correct
cache-line, which happens on a clock-line edge).

This may also need to MUX PC's from things like the branch-predictor and branch-initiation logic, which then override the normal PC+Step handling generated from the IF->PF path (also typically at a low latency).

In this case, the end of the IF stage also handles some amount of repacking;
Possible:
Right-justifying the fetched instructions;
16 -> 32 bit repacking (for RV-C)
Current:
Renormalization of XG1/XG2/XG3 into the same internal scheme;
Repacking 48-bit RISC-V ops into internal 64-bit forms;
...

As a partial result of this repacking, the instruction words effectively
gain a few extra bits (the "internal normalized format" no longer
fitting entirely into a 32-bit word; where one could almost see it as a
sort of "extended instruction" that includes both ISAs in a single slightly-larger virtual instruction word).

One could go further and try to re-normalize the full instruction
layout, but as noted XG3 and RV would still differ enough as to make
this annoying (mostly the different encoding spaces and immed formats).

* zzzzzzz-ooooo-mmmmm-zzz-nnnnn-yy-yyy11
* zzzz-oooooo-mmmmmm-zzzz-nnnnnn-yy-yyPw

With a possible normalized format (36-bit):
* zzzzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyyyPw
* zzzzzzz-0ooooo-0mmmmm-yzzz-0nnnnn-1yyyy10 (RV Repack)
* 000zzzz-oooooo-mmmmmm-zzzz-nnnnnn-0yyyyPw (XG3 Repack)

Couldn't fully unify the encoding space within a single clock cycle
though (within a reasonable cost budget).

At present, the decoder handling is to essentially unify the 32-bit
format for XG1/XG2/XG3 as XG2 with a few tag bits to disambiguate which
ISA decoding rules should apply for the 32-bit instruction word in
question. The other option would have been to normalize as XG3, but XG3
loses some minor functionality from XG1 and XG2.

I also went against allowing RV and XG3 jumbo prefixes to be mixed.
Though, it is possible exceptions could be made.

Wouldn't have needed J52I if XG3 prefixes could have been used with RV
ops, but can't use XG3 prefixes in RV-C mode, which is part of why I
ended up resorting to the J52I prefix hack. But, still doesn't fully
address the issues that exist with hot-patching in this mode.

Though, looking at options, the "cheapest but fastest" option at present likely being:
Core that only does XG3, possibly dropping the RV encodings and
re-adding WEX in its place (though, in such an XG3-Only mode, the 10/11
modes would otherwise be identical in terms of encoding).

Or, basically, XG3 being used in a way more like how XG2 was used.

But, don't really want to create yet-more modes at the moment. XG3 being
used as superscalar isn't too much more expensive, and arguably more
flexible given the compiler doesn't need to be aware of pipeline
scheduling specifics, but can still make use of this when trying to
shuffle instructions around for efficiency (a mismatch will then merely
result in a small reduction in efficiency rather than a potential
inability of the code to run; though for XG2 there was the feature that
the CPU could fall back to scalar or potential superscalar operation in
cases where the compiler's bundling was incompatible with what the CPU allowed).

So, it is possible that in-order superscalar may be better as a general purpose option even if not strictly the cheapest option.

A case could maybe be made arguing for dropping back down to 32 GPRs
(with no FPRs) for more cheapness, but as-is, trying to do 128-bit SIMD
stuff in RV64 mode also tends to quickly run into issues with register pressure.

Well, and I was just recently having to partly rework the mechanism for:
v = (__vec4f) { x, y, z, w };
To not try to load all the registers at the same time, as this was occasionally running out of free dynamic registers with the normal RV
ABI (and 12 callee-save FPRs doesn't go quite so far when allocating
pairs of them), which effectively causes the compiler to break.

It is almost tempting to consider switching RV64 over to the XG3 ABI
when using SIMD, well, and/or not use SIMD with RV64 because it kinda
sucks worse than XG3.

But... Comparably, for the TKRA-GL front-end (using syscalls for the back-end), using runtime calls and similar for vector operations does
still put a big dent in the framerate for GLQuake (so, some sort of SIMD
in RV mode may still be needed even if "kinda inferior").

Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.

Difference:
With a prefix, you know in advance the prefix exists (the prefix is immediately visible);
With a suffix, can only know it exists if it is visible.

So, it poses similar issues to those in making superscalar fetch work
across cache lines, which is made more of a challenge if one wants to
make superscalar across line boundaries work during I$ miss handling
rather than during instruction fetch.

But, if the logic is able to run at I$ miss time, ideally it can see
whether or not it is a single-wide or multi-wide instruction (say,
because this logic runs only on a single cache line at a time).

It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting
for a prefix / suffix.

Usual behavior is that when an interrupt occurs, SPC or similar always
points at a valid PC, namely one from the pipeline, and usually/ideally,
the exact instruction on which the fault occurred (though this does get
a little fiddly with multiple instructions in the pipeline, and the CPU sometimes needs to figure out which stage corresponds to the fault in question; but this is usually more of an issue for I$ TLB misses, which often/usually trigger during a branch).

For the most part, superscalar works the same either way, with similar
efficiency. There is a slight efficiency boost if it would be possible
to dynamically reshuffle ops during fetch. But, this is not currently
a thing in my case.

This latter case would apply if, say, a MEM op is followed by non-
dependent ALU ops, which under current superscalar handling they will
not co-execute, but it could be possible in theory to swap the ops and
allow them to co-execute.

...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jan 6 23:49:05 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

<snip>

One would argue that maybe prefixes are themselves wonky, but otherwise one needs:
Instructions that can directly encode the presence of large immediate values, etc;

This is the direction of My 66000.

The instruction stream is a linear stream of words.
The first word of each instruction encodes its total length.
What follows the instruction itself are merely constants used as
operands in the instruction itself. All constants are 1 or 2
words in length.

I would not call this means "prefixed" or "suffixed". Generally,
prefixes and suffixes consume bits of the prefix/suffix so that
the constant (in my case) is not equal to container size. This
leads to wonky operand/displacement sizes not equal 2^(3+k).

Or, the use of suffix-encodings (which is IMHO worse than prefix encodings; at least prefix encodings make intuitive sense if one views
the instruction stream as linear, whereas suffixes add weirdness and are effectively retro-causal, and for any fetch to be safe at the end of a cache line one would need to prove the non-existence of a suffix; so better to not go there).

I agree with this. Prefixes seem more natural, large numbers expanding
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction.

But not "self identified".

I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.

Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.

It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting
for a prefix / suffix.

I would have thought that the previous instruction (last one retired) would provide the starting point of the subsequent instruction. This way you don't have to worry about counting prefixes or suffixes.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Jan 6 20:34:53 2026

From Newsgroup: comp.arch

On 2026-01-06 5:42 p.m., BGB wrote:

On 12/31/2025 2:23 AM, Robert Finch wrote:

<snip>

One would argue that maybe prefixes are themselves wonky, but
otherwise one needs:
Instructions that can directly encode the presence of large immediate
values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one
views the instruction stream as linear, whereas suffixes add
weirdness and are effectively retro-causal, and for any fetch to be
safe at the end of a cache line one would need to prove the non-
existence of a suffix; so better to not go there).

I agree with this. Prefixes seem more natural, large numbers expanding
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction. I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the
current instruction slot plus a following area of the cache-line to
the decoder.

ID stage is likely too late.

For PC advance, ideally this needs to be known by the IF stage so that
we can know how to advance PC for the next clock-cycle (for the PF stage).

Say:
-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a-a-a-a PF IF ID RF E1 E2 E3 WB

The PC advance works okay without knowing whether there is a suffix
present or not. The suffix is treated like a NOP instruction. There is
no decode required at the fetch stage. The PC can land on a suffix. It
just always advances by four (N) instructions unless there is a branch.

So, each IF stage producing an updated PC that needs to reach PF within
the same clock-cycle (so the SRAMs can fetch data for the correct cache- line, which happens on a clock-line edge).

This may also need to MUX PC's from things like the branch-predictor and branch-initiation logic, which then override the normal PC+Step handling generated from the IF->PF path (also typically at a low latency).

In this case, the end of the IF stage also handles some amount of
repacking;
-a Possible:
-a-a-a Right-justifying the fetched instructions;
-a-a-a 16 -> 32 bit repacking (for RV-C)
-a Current:
-a-a-a Renormalization of XG1/XG2/XG3 into the same internal scheme;
-a-a-a Repacking 48-bit RISC-V ops into internal 64-bit forms;
-a-a-a ...

As a partial result of this repacking, the instruction words effectively gain a few extra bits (the "internal normalized format" no longer
fitting entirely into a 32-bit word; where one could almost see it as a
sort of "extended instruction" that includes both ISAs in a single slightly-larger virtual instruction word).

One could go further and try to re-normalize the full instruction
layout, but as noted XG3 and RV would still differ enough as to make
this annoying (mostly the different encoding spaces and immed formats).

* zzzzzzz-ooooo-mmmmm-zzz-nnnnn-yy-yyy11
* zzzz-oooooo-mmmmmm-zzzz-nnnnnn-yy-yyPw

With a possible normalized format (36-bit):
* zzzzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyyyPw
* zzzzzzz-0ooooo-0mmmmm-yzzz-0nnnnn-1yyyy10 (RV Repack)
* 000zzzz-oooooo-mmmmmm-zzzz-nnnnnn-0yyyyPw (XG3 Repack)

Couldn't fully unify the encoding space within a single clock cycle
though (within a reasonable cost budget).

At present, the decoder handling is to essentially unify the 32-bit
format for XG1/XG2/XG3 as XG2 with a few tag bits to disambiguate which
ISA decoding rules should apply for the 32-bit instruction word in
question. The other option would have been to normalize as XG3, but XG3 loses some minor functionality from XG1 and XG2.

I also went against allowing RV and XG3 jumbo prefixes to be mixed.
Though, it is possible exceptions could be made.

Wouldn't have needed J52I if XG3 prefixes could have been used with RV
ops, but can't use XG3 prefixes in RV-C mode, which is part of why I
ended up resorting to the J52I prefix hack. But, still doesn't fully
address the issues that exist with hot-patching in this mode.

Though, looking at options, the "cheapest but fastest" option at present likely being:
Core that only does XG3, possibly dropping the RV encodings and re-
adding WEX in its place (though, in such an XG3-Only mode, the 10/11
modes would otherwise be identical in terms of encoding).

Or, basically, XG3 being used in a way more like how XG2 was used.

But, don't really want to create yet-more modes at the moment. XG3 being used as superscalar isn't too much more expensive, and arguably more flexible given the compiler doesn't need to be aware of pipeline
scheduling specifics, but can still make use of this when trying to
shuffle instructions around for efficiency (a mismatch will then merely result in a small reduction in efficiency rather than a potential
inability of the code to run; though for XG2 there was the feature that
the CPU could fall back to scalar or potential superscalar operation in cases where the compiler's bundling was incompatible with what the CPU allowed).

So, it is possible that in-order superscalar may be better as a general purpose option even if not strictly the cheapest option.

A case could maybe be made arguing for dropping back down to 32 GPRs
(with no FPRs) for more cheapness, but as-is, trying to do 128-bit SIMD stuff in RV64 mode also tends to quickly run into issues with register pressure.

Well, and I was just recently having to partly rework the mechanism for:
-a v = (__vec4f) { x, y, z, w };
To not try to load all the registers at the same time, as this was occasionally running out of free dynamic registers with the normal RV
ABI (and 12 callee-save FPRs doesn't go quite so far when allocating
pairs of them), which effectively causes the compiler to break.

It is almost tempting to consider switching RV64 over to the XG3 ABI
when using SIMD, well, and/or not use SIMD with RV64 because it kinda
sucks worse than XG3.

But... Comparably, for the TKRA-GL front-end (using syscalls for the back-end), using runtime calls and similar for vector operations does
still put a big dent in the framerate for GLQuake (so, some sort of SIMD
in RV mode may still be needed even if "kinda inferior").

Handling suffixes at the end of a cache-line is not too bad if the
cache already handles instructions spanning a cache line. Assume the
maximum number of suffixes is present and ensure the cache-line is
wide enough. Or limit the number of suffixes so they fit into the half
cache-line used for spanning.

Difference:
With a prefix, you know in advance the prefix exists (the prefix is immediately visible);
With a suffix, can only know it exists if it is visible.

Instead with a prefix one only knows the instruction exists if it is
visible. I do not think it makes much difference. Except that it may be
harder to decode an instruction and look for a prefix that comes before it.

So, it poses similar issues to those in making superscalar fetch work
across cache lines, which is made more of a challenge if one wants to
make superscalar across line boundaries work during I$ miss handling
rather than during instruction fetch.

But, if the logic is able to run at I$ miss time, ideally it can see
whether or not it is a single-wide or multi-wide instruction (say,
because this logic runs only on a single cache line at a time).

It is easier to handle interrupts with suffixes. The suffix can just
be treated as a NOP. Adjusting the position of the hardware interrupt
to the start of an instruction then does not have to worry about
accounting for a prefix / suffix.

Usual behavior is that when an interrupt occurs, SPC or similar always points at a valid PC, namely one from the pipeline, and usually/ideally,
the exact instruction on which the fault occurred (though this does get
a little fiddly with multiple instructions in the pipeline, and the CPU sometimes needs to figure out which stage corresponds to the fault in question; but this is usually more of an issue for I$ TLB misses, which often/usually trigger during a branch).

For the most part, superscalar works the same either way, with
similar efficiency. There is a slight efficiency boost if it would be
possible to dynamically reshuffle ops during fetch. But, this is not
currently a thing in my case.

This latter case would apply if, say, a MEM op is followed by non-
dependent ALU ops, which under current superscalar handling they will
not co-execute, but it could be possible in theory to swap the ops
and allow them to co-execute.

...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Jan 6 23:27:28 2026

From Newsgroup: comp.arch

On 1/6/2026 7:34 PM, Robert Finch wrote:

On 2026-01-06 5:42 p.m., BGB wrote:

On 12/31/2025 2:23 AM, Robert Finch wrote:

<snip>

One would argue that maybe prefixes are themselves wonky, but
otherwise one needs:
Instructions that can directly encode the presence of large
immediate values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one
views the instruction stream as linear, whereas suffixes add
weirdness and are effectively retro-causal, and for any fetch to be
safe at the end of a cache line one would need to prove the non-
existence of a suffix; so better to not go there).

I agree with this. Prefixes seem more natural, large numbers
expanding to the left, suffixes seem like a big-endian approach. But
I use suffixes for large constants. I think with most VLI constant
data follows the instruction. I find constant data easier to work
with that way and they can be processed in the same clock cycle as a
decode so they do not add to the dynamic instruction count. Just pass
the current instruction slot plus a following area of the cache-line
to the decoder.

ID stage is likely too late.

For PC advance, ideally this needs to be known by the IF stage so that
we can know how to advance PC for the next clock-cycle (for the PF
stage).

Say:
-a-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a-a-a-a-a PF IF ID RF E1 E2 E3 WB

The PC advance works okay without knowing whether there is a suffix
present or not. The suffix is treated like a NOP instruction. There is
no decode required at the fetch stage. The PC can land on a suffix. It
just always advances by four (N) instructions unless there is a branch.

You don't want to burn extra clock cycles stepping over either prefixes
or suffixes.

In any case, for either prefixed or suffixed instructions, best to have
the fetch/decode and PC advance handle the whole thing as a single
larger instruction.

Well, likewise for superscalar; which would be rendered moot if one
spent multiple clock cycles stepping over instruction words.

<snip>

It is almost tempting to consider switching RV64 over to the XG3 ABI
when using SIMD, well, and/or not use SIMD with RV64 because it kinda
sucks worse than XG3.

But... Comparably, for the TKRA-GL front-end (using syscalls for the
back-end), using runtime calls and similar for vector operations does
still put a big dent in the framerate for GLQuake (so, some sort of
SIMD in RV mode may still be needed even if "kinda inferior").

Well, anyways, have since gotten SIMD in RV64 mode working "slightly
better" but it still kinda sucks vs SIMD in XG3.

Though the scope is a lot more limited.
It is RV64G FPU ops, but:
FADD.S and similar are understood as 2x Binary32 vs 1x Binary32;
Optional: A rounding mode can specify 4x Binary32 vectors.
There are FPU PACK instructions:
These pack two 32-bit low or high elements into a 64-bit result.

GLQuake .text size:
XG3 : 570K
RV64: 682K

Framerate at start of "start.bsp" (50MHz):
XG3 : 5 fps
RV64: 3 fps
Vs, plain RV64G:
RV64: 1 fps (and around 800K of .text)

Still beats a GCC build here, which rolls in at a good solid 0 fps.
Seemingly stuff can get wonky below 1 fps, but not entirely obvious from
the code.

Handling suffixes at the end of a cache-line is not too bad if the
cache already handles instructions spanning a cache line. Assume the
maximum number of suffixes is present and ensure the cache-line is
wide enough. Or limit the number of suffixes so they fit into the
half cache-line used for spanning.

Difference:
With a prefix, you know in advance the prefix exists (the prefix is
immediately visible);
With a suffix, can only know it exists if it is visible.

Instead with a prefix one only knows the instruction exists if it is visible. I do not think it makes much difference. Except that it may be harder to decode an instruction and look for a prefix that comes before it.

If PC somehow lands between the prefix and the instruction it applies
to, then something has gone wrong.

But, even then, checking if PC-4 is a jumbo prefix, or PC-8 is a J52I or similar, isn't exactly a hard task. CPU doesn't need to do this though,
and if by some chance it does happen, "it is what it is".

Still better than say:
x86: Jumping into the middle of an instruction results in incoherent
garbage from then on (automatic re-alignment will not happen);
RV-C: Will typically re-align within 3-5 16-bit words.

With both XG2/XG3 and RV64G+JX, branching into the middle of a prefixed instruction should resolve within 1 or 2 instruction words (32-bit).

Given in these cases that jumbo-prefixed encodings are the minority,
even starting at a random 32-bit aligned location is most likely to land
on a valid instruction boundary. Landing on the second word of a
misaligned J52I prefix will still align within 2 words (reading one
garbage instruction, followed by the base-instruction that the prefix
had been modifying; with a very low probability of reading a false jumbo prefix).

With XG1, it is most likely to realign within 3 16-bit words.
It was more likely to realign quickly as '111' is more statistically
selective than '11', and the probability of seeing multiple false-hit misaligned 32-bit instructions in a row was very low (and, the
instruction format and encoding-space layout made decoding multiple consecutive misaligned 32-bit instructions very unlikely).

While RV-C has a similar property (causing the instruction stream to
realign), it is less selective due to the instruction formats so there
is a higher probability of seeing two false 32-bit instructions in a row.

In either case, not seeing a false 32-bit instruction would lead to it
seeing a 16-bit instruction, which would then (almost inevitably)
re-align the decoder.

Within XG2 and XG3, one would not see a false jumbo-prefixed
instruction, as the instruction stream is encoded as instruction words
where prefixes and suffixes are mutually exclusive (so, in this way, it
is almost more like UTF-8; where it can unambiguously re-align the
byte-stream within a single codepoint).

This property doesn't exist for x86 though, which would happily just
result in reading a never-ending stream of misaligned garbage instructions.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Jan 7 01:22:54 2026

From Newsgroup: comp.arch

On 1/6/2026 5:49 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

<snip>

One would argue that maybe prefixes are themselves wonky, but otherwise
one needs:
Instructions that can directly encode the presence of large immediate
values, etc;

This is the direction of My 66000.

The instruction stream is a linear stream of words.
The first word of each instruction encodes its total length.
What follows the instruction itself are merely constants used as
operands in the instruction itself. All constants are 1 or 2
words in length.

I would not call this means "prefixed" or "suffixed". Generally,
prefixes and suffixes consume bits of the prefix/suffix so that
the constant (in my case) is not equal to container size. This
leads to wonky operand/displacement sizes not equal 2^(3+k).

OK.

As can be noted:
XG2/3: Prefix scheme, 1/2/3 x 32-bit
The 96-bit cases are determined by two prefixes.
Requires looking at 2 words to know total length.
RV64+Jx:
Total length is known from the first instruction word:
Base op: 32 bits;
J21I: 64 bits
J52I: 96 bits.
There was a J22+J22+LUI special case,
but I now consider this as deprecated.
J52I+ADDI is now considered preferable.

As for Imm/Disp sizes:
XG1: 9/33/57
XG2 and XG3: 10/33/64
RV+JX: 12/33/64

For XG1, the 57-bit size was rarely used and only optionally supported,
mostly because of the great "crap all of immediate values between 34 and
62 bits" gulf.

Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one views
the instruction stream as linear, whereas suffixes add weirdness and are >>> effectively retro-causal, and for any fetch to be safe at the end of a
cache line one would need to prove the non-existence of a suffix; so
better to not go there).

I agree with this. Prefixes seem more natural, large numbers expanding
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction.

But not "self identified".

Yeah, if you can't know whether or not more instruction follows after
the first word by looking at the first word, this is a drawback.

Also, if you have to look at some special combination of register
specifiers and/or a lot of other bits, this is also a problem.

I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current
instruction slot plus a following area of the cache-line to the decoder.

Handling suffixes at the end of a cache-line is not too bad if the cache
already handles instructions spanning a cache line. Assume the maximum
number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.

It is easier to handle interrupts with suffixes. The suffix can just be
treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting
for a prefix / suffix.

I would have thought that the previous instruction (last one retired) would provide the starting point of the subsequent instruction. This way you don't have to worry about counting prefixes or suffixes.

Yeah.

My thinking is, typical advance:
IF figures out how much to advance;
Next instruction gets PC+Step.

Then interrupt:
Figure out which position in the pipeline interrupt starts from;
Start there, flushing the rest of the pipeline;
For a faulting instruction, this is typically the EX1 or EX2 stage.
EX1 if it is a TRAP or SYSCALL;
EX2 if it is a TLB miss or similar;
Unless EX2 is not a valid spot (flush or bubble),
then look for a spot that is not a flush or bubble.
This case usually happens for branch-related TLB misses.]

Usually EX3 or WB is too old, as it would mean re-running previous instructions.

Getting the exact stage-timing correct for interrupts is a little
fiddly, but worrying about prefix/suffix/etc issues with interrupts
isn't usually an issue, except that if somehow PC ended up pointing
inside another instruction, I would consider this a fault.

Usually for sake of branch-calculations in XG3 and RV, it is relative to
the BasePC before the prefix in the case of prefixed encodings. This
differs from XG1 and XG2 which defined branches relative to the PC of
the following instruction.

Though, this difference was partly due to a combination of
implementation reasons and for consistency with RISC-V (when using a
shared encoding space, makes sense if all the branches define PC
displacements in a consistent way).

Though, there is the difference that XG3's branches use a 32-bit scale
rather than a 16-bit scale. Well, and unlike RV's displacements, they
are not horrible confetti (*1).

*1: One can try to write a new RV decoder, and then place bets on
whether they will get JAL and Bcc encodings correct on the first try.
IME, almost invariably, one will screw these up in some way on the first attempt. Like, JAL's displacement encoding is "the gift that keeps on
giving" in this sense.

Like, they were like:
ADDI / Load:
Yay, contiguous bits;
Store:
Well, swap the registers around and put the disp where Rd went.
Bcc:
Well, take the Store disp and just shuffle around a few more bits;
JAL:
Well, now there are some more bits, and Rd is back, ...
Why not keep some of the bits from Bcc,
but stick everything else in random places?...
Well, I guess some share the relative positions as LUI, but, ...

Not perfect in XG3 either, but still:
{ opw[5] ? 11'h7FF : 11'h000, opw[11:6], opw[31:16] }
Is nowhere near the same level of nasty...

Well, nevermind if actual decoder has the reverse issue:
In the VL core and JX2VM, it was internally repacked back into XG2 form internally, which means a little bit of hair going on here. Also I was originally going to relocate it in the encoding space, but ended up
moving back to its original location as for reasons (mostly due to
sharing the same decoder) having BRA/BSR in two different locations
would have effectively burned more encoding space than just leaving it
where it had been in XG1/XG2 (even if having BRA/BSR in the F0 block is
"kinda stupid" given it is "very much not a 3R instruction", but, ...).

At least in most other instructions, the imm/disp bits remain
contiguous. I instead differed by making the Rn/Rd spot be used as a
source register by some instructions (taking on the role of Rt/Rs2), but
IMO this is the lesser of two evils. Would rather have an Rd that is
sometimes a source, than imm/disp fields that change chaotically from
one instruction to another.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Jan 7 05:34:28 2026

From Newsgroup: comp.arch

On 2026-01-07 2:22 a.m., BGB wrote:

On 1/6/2026 5:49 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

<snip>

One would argue that maybe prefixes are themselves wonky, but otherwise >>>> one needs:
Instructions that can directly encode the presence of large immediate
values, etc;

This is the direction of My 66000.

The instruction stream is a linear stream of words.
The first word of each instruction encodes its total length.
What follows the instruction itself are merely constants used as
operands in the instruction itself. All constants are 1 or 2
words in length.

I would not call this means "prefixed" or "suffixed". Generally,
prefixes and suffixes consume bits of the prefix/suffix so that
the constant (in my case) is not equal to container size. This
leads to wonky operand/displacement sizes not equal 2^(3+k).

OK.

As can be noted:
-a XG2/3: Prefix scheme, 1/2/3 x 32-bit
-a-a-a The 96-bit cases are determined by two prefixes.
-a-a-a Requires looking at 2 words to know total length.
-a RV64+Jx:
-a-a-a Total length is known from the first instruction word:
-a-a-a-a-a Base op: 32 bits;
-a-a-a-a-a J21I: 64 bits
-a-a-a-a-a J52I: 96 bits.
-a-a-a There was a J22+J22+LUI special case,
-a-a-a-a-a but I now consider this as deprecated.
-a-a-a-a-a J52I+ADDI is now considered preferable.

As for Imm/Disp sizes:
-a XG1: 9/33/57
-a XG2 and XG3: 10/33/64
-a RV+JX: 12/33/64

For XG1, the 57-bit size was rarely used and only optionally supported, mostly because of the great "crap all of immediate values between 34 and
62 bits" gulf.

Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one views >>>> the instruction stream as linear, whereas suffixes add weirdness and
are
effectively retro-causal, and for any fetch to be safe at the end of a >>>> cache line one would need to prove the non-existence of a suffix; so
better to not go there).

I agree with this. Prefixes seem more natural, large numbers expanding
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction.

But not "self identified".

Yeah, if you can't know whether or not more instruction follows after
the first word by looking at the first word, this is a drawback.

I do not find having to look at the second word much of a drawback.
There is not much difference looking at either the first or second word.
The words are all sitting available on the cache-line.
Large constants are treated as more of the exceptional case in Qupls4.
The immediate mode instructions can handle 28-bit constants. One suffix expands that out to 64-bits.
By placing the constant info in the suffix, Qupls4 gains bits in the instruction that can be used for other purposes rather than handling
large constants.
Using a suffix (or prefix) does lead to odd sized constants, but as long
as they are large enough so what.

Also, if you have to look at some special combination of register
specifiers and/or a lot of other bits, this is also a problem.

I do not know. It depends how it is handled. Qups4 decodes r63 as the
constant zero, so it is a special register spec, like r0 in many
machines. r62 gets decoded as the IP value.
I think the constant decode is not likely on the timing critical path
provided it is semi-sane.
Currently on the timing path for Qupls4 is expanding out instructions to multiple micro-ops. I think it needs another pipeline stage.

-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current
instruction slot plus a following area of the cache-line to the decoder. >>>
Handling suffixes at the end of a cache-line is not too bad if the cache >>> already handles instructions spanning a cache line. Assume the maximum
number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.

It is easier to handle interrupts with suffixes. The suffix can just be
treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting >>> for a prefix / suffix.

I would have thought that the previous instruction (last one retired)
would
provide the starting point of the subsequent instruction. This way you
don't
have to worry about counting prefixes or suffixes.

Yeah.

My thinking is, typical advance:
-a IF figures out how much to advance;
-a Next instruction gets PC+Step.

Qupls4 does not bother figuring out how much to advance; it would be too
slow. It just assumes an increment. Why figure it out? If there are instructions in a bundle just advance to the next bundle. I found the IP selection on the timing critical path, the BTB had to be adjusted.

Then interrupt:
-a Figure out which position in the pipeline interrupt starts from;
-a Start there, flushing the rest of the pipeline;
-a For a faulting instruction, this is typically the EX1 or EX2 stage.
-a-a-a EX1 if it is a TRAP or SYSCALL;
-a-a-a EX2 if it is a TLB miss or similar;
-a-a-a-a-a Unless EX2 is not a valid spot (flush or bubble),
-a-a-a-a-a-a-a then look for a spot that is not a flush or bubble.
-a-a-a-a-a This case usually happens for branch-related TLB misses.]

Usually EX3 or WB is too old, as it would mean re-running previous instructions.

Getting the exact stage-timing correct for interrupts is a little
fiddly, but worrying about prefix/suffix/etc issues with interrupts
isn't usually an issue, except that if somehow PC ended up pointing
inside another instruction, I would consider this a fault.

Usually for sake of branch-calculations in XG3 and RV, it is relative to
the BasePC before the prefix in the case of prefixed encodings. This
differs from XG1 and XG2 which defined branches relative to the PC of
the following instruction.

Though, this difference was partly due to a combination of
implementation reasons and for consistency with RISC-V (when using a
shared encoding space, makes sense if all the branches define PC displacements in a consistent way).

Though, there is the difference that XG3's branches use a 32-bit scale rather than a 16-bit scale. Well, and unlike RV's displacements, they
are not horrible confetti (*1).

*1: One can try to write a new RV decoder, and then place bets on
whether they will get JAL and Bcc encodings correct on the first try.
IME, almost invariably, one will screw these up in some way on the first attempt. Like, JAL's displacement encoding is "the gift that keeps on giving" in this sense.

Like, they were like:
-a ADDI / Load:
-a-a-a Yay, contiguous bits;
-a Store:
-a-a-a Well, swap the registers around and put the disp where Rd went.
-a Bcc:
-a-a-a Well, take the Store disp and just shuffle around a few more bits;
-a JAL:
-a-a-a Well, now there are some more bits, and Rd is back, ...
-a-a-a Why not keep some of the bits from Bcc,
-a-a-a-a-a but stick everything else in random places?...
-a-a-a Well, I guess some share the relative positions as LUI, but, ...

Not perfect in XG3 either, but still:
-a { opw[5] ? 11'h7FF : 11'h000, opw[11:6], opw[31:16] }
Is nowhere near the same level of nasty...

Yeah, They may have gone a bit overboard trying to keep the constant
bits in the same position for RISCV. It does make things smaller.

Well, nevermind if actual decoder has the reverse issue:
In the VL core and JX2VM, it was internally repacked back into XG2 form internally, which means a little bit of hair going on here. Also I was originally going to relocate it in the encoding space, but ended up
moving back to its original location as for reasons (mostly due to
sharing the same decoder) having BRA/BSR in two different locations
would have effectively burned more encoding space than just leaving it
where it had been in XG1/XG2 (even if having BRA/BSR in the F0 block is "kinda stupid" given it is "very much not a 3R instruction", but, ...).

At least in most other instructions, the imm/disp bits remain
contiguous. I instead differed by making the Rn/Rd spot be used as a
source register by some instructions (taking on the role of Rt/Rs2), but
IMO this is the lesser of two evils. Would rather have an Rd that is sometimes a source, than imm/disp fields that change chaotically from
one instruction to another.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@quadibloc@invalid.invalid to comp.arch on Sat Jan 31 18:14:31 2026

From Newsgroup: comp.arch

I attempted to construct an architecture with ordinary variable-length instructions, based on the Concertina II architecture, but with register
banks of 16 registers instead of 32. However, it didn't work out; there
wasn't enough opcode space for the 16-bit register-to-register
instructions. This surprised me; maybe I can figure something out.

John Savard

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 4 20:36:52 2026

From Newsgroup: comp.arch

John Savard <quadibloc@invalid.invalid> posted:

I attempted to construct an architecture with ordinary variable-length instructions, based on the Concertina II architecture, but with register banks of 16 registers instead of 32. However, it didn't work out; there wasn't enough opcode space for the 16-bit register-to-register
instructions. This surprised me; maybe I can figure something out.

In my honest opinion, to get 16-bit instructions to work/help/add-value
you have to sacrifice much elsewhere in your ISA.

Whether you can get it to work is a semblance of your balance.

John Savard

--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Wed Feb 4 21:32:58 2026

From Newsgroup: comp.arch

On 12/28/25 6:53 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/28/2025 2:04 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:

On 12/21/2025 1:21 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/21/2025 10:12 AM, MitchAlsup wrote:

John Savard <quadibloc@invalid.invalid> posted:

On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}

For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.

Since you actually worked at AMD, presumably you know why I'm mistaken
here...

when I read this, I thought that there was a standard technique for >>>>>>>>> doing
stuff like that in a GBOoO machine.

There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.

-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
cases.

x86 has short/small MOV instructions, Not so with RISCs.

Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>

The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
Oh, and its ESM not EMS. Exotic Synchronization Method.

In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.

Thanks for the clarification.

On x86/x64 LOCK XADD is a loopless wait free operation.

I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o

So, can your system get wait free semantics for RMW atomics?

A::

ATOMIC-to-Memory-size [address]
ADD Rd,--,#1

Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped
out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is typical.} >>>
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.

{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}

A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.

Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.

Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".

This seems to be similar to branch hints and predication in that
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer. A programmer
might use predication to avoid a branch that is expected to be
poorly predicted or to have more consistent execution time. The
former could be inappropriate for the computer to obey if the
branch predictor became effective for that branch. If prediction
is accurate, predicate prediction could improve performance but
would break execution time consistency. Even reducing execution
time when the predicate is known early might go against the
programmer's intent by leaking information.

If the requesting core has the cache block in exclusive or
modified state, remote execution might be less efficient. Yet it
may also be possible that the block is expected to be moved to
another core such that this pushing the data manipulation to a
"centralized" location would improve performance as was the
programmer intent (rather than avoiding contention overhead). (I
suppose in My 66000, a programmer would use a push/"prefetch"
instruction to move the cache block to a more central location,
but even that might be sub-optimal if the hardware can predict
the next data consumer such that centrally located would be
slower than shifting the data closer to the consumer.)

If the contention is from false sharing (having multiple atomic
data in a cache block seems to be considered bad programming
practice, so this should not be common unless cache block size
grows), hardware could theoretically provide special word caches
(or "lossy" block compression where part of the block is dropped)
for moderating the impact of false sharing. This would change the
optimization preferences for the program (more compact data might
be preferred if false sharing is less of a problem).

I do not know what the best interface would be, but it seems that
some care should be taken to account for differing intent when a
programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
hints and directives can have unexpected performance changes
under different microarchitectures or different usage.

Clever microarchitecture can make some optimizations sub-optimal
as well as cause programmers to ask "why didn't you do it the way
I told you to do it?!"

(I think I may not be communicating well. I am kind of tired
right now and this topic is complicated.)
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 5 18:49:39 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> posted:

On 12/28/25 6:53 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/28/2025 2:04 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:

On 12/21/2025 1:21 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/21/2025 10:12 AM, MitchAlsup wrote:

John Savard <quadibloc@invalid.invalid> posted:

On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}

For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.

Since you actually worked at AMD, presumably you know why I'm mistaken
here...

when I read this, I thought that there was a standard technique for >>>>>>>>> doing
stuff like that in a GBOoO machine.

There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.

-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
cases.

x86 has short/small MOV instructions, Not so with RISCs.

Does your EMS use a so called LOCK MOV? For some damn reason I remember
something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>

The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
Oh, and its ESM not EMS. Exotic Synchronization Method.

In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.

Thanks for the clarification.

On x86/x64 LOCK XADD is a loopless wait free operation.

I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o

So, can your system get wait free semantics for RMW atomics?

A::

ATOMIC-to-Memory-size [address]
ADD Rd,--,#1

Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>> out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is typical.} >>>
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.

{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}

A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.

Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.

Can recognize:

LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]

Cannot recognize:

LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional

Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".

There is a speed of light problem here. Communicating across a
computer <system> is a microsecond time problem, whereas executing
instructions is a nanosecond time problem.

And this is exactly where Add-to-Memory gains over Interferable
ATOMIC events--you only pay the latency once, now while the latency
is higher than possible with LL-SC, it is WAY LOWER than worst case
with LL-SC under serious contention.

This seems to be similar to branch hints and predication in that
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.

Explain

A programmer
might use predication to avoid a branch that is expected to be
poorly predicted or to have more consistent execution time. The

My 66000 predication can avoid 2 branches--it operates under the
notion that if FETCH reaches the join point before condition is
known, then predication is always faster than branching.

former could be inappropriate for the computer to obey if the
branch predictor became effective for that branch. If prediction
is accurate, predicate prediction could improve performance but
would break execution time consistency. Even reducing execution
time when the predicate is known early might go against the
programmer's intent by leaking information.

There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.

If the requesting core has the cache block in exclusive or
modified state, remote execution might be less efficient. Yet it
may also be possible that the block is expected to be moved to
another core such that this pushing the data manipulation to a
"centralized" location would improve performance as was the
programmer intent (rather than avoiding contention overhead). (I
suppose in My 66000, a programmer would use a push/"prefetch"
instruction to move the cache block to a more central location,
but even that might be sub-optimal if the hardware can predict
the next data consumer such that centrally located would be
slower than shifting the data closer to the consumer.)

I have been betting that (in the general case) software will
remain incapable of doing such predictions, for quite some time.

If the contention is from false sharing (having multiple atomic
data in a cache block seems to be considered bad programming
practice, so this should not be common unless cache block size
grows), hardware could theoretically provide special word caches
(or "lossy" block compression where part of the block is dropped)
for moderating the impact of false sharing. This would change the optimization preferences for the program (more compact data might
be preferred if false sharing is less of a problem).

I do not know what the best interface would be, but it seems that
some care should be taken to account for differing intent when a
programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
hints and directives can have unexpected performance changes
under different microarchitectures or different usage.

Clever microarchitecture can make some optimizations sub-optimal
as well as cause programmers to ask "why didn't you do it the way
I told you to do it?!"

Instruction scheduling is in effect a optimization for one implementation
that may not hold for other implementations. In Order pipelines need instruction scheduling, OoO do not, GBOoO generally perform better if
/when instructions are not (or lesser) scheduled.

Loop unrolling is a case where if your machine has vVM the code
runs faster when not {unrolled and executed with branch prediction}
than when {}.

(I think I may not be communicating well. I am kind of tired
right now and this topic is complicated.)

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 5 21:27:51 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Paul Clayton <paaronclayton@gmail.com> posted:

-----------------

Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.

Can recognize:

LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]

Cannot recognize:

LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional

Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".

There is a speed of light problem here. Communicating across a
computer <system> is a microsecond time problem, whereas executing instructions is a nanosecond time problem.

And this is exactly where Add-to-Memory gains over Interferable
ATOMIC events--you only pay the latency once, now while the latency
is higher than possible with LL-SC, it is WAY LOWER than worst case
with LL-SC under serious contention.

This seems to be similar to branch hints and predication in that
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.

Explain

A programmer
might use predication to avoid a branch that is expected to be
poorly predicted or to have more consistent execution time. The

My 66000 predication can avoid 2 branches--it operates under the
notion that if FETCH reaches the join point before condition is
known, then predication is always faster than branching.

former could be inappropriate for the computer to obey if the
branch predictor became effective for that branch. If prediction
is accurate, predicate prediction could improve performance but
would break execution time consistency. Even reducing execution
time when the predicate is known early might go against the
programmer's intent by leaking information.

There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.

If the requesting core has the cache block in exclusive or
modified state, remote execution might be less efficient. Yet it

In My 66000 architecture, an atomic operation on memory will have
the property of being executed where the cache line in a modifiable
state happens to reside. Here you don't want to "bring data in"
nor do you want to "push data out" you want the memory hierarchy
to be perturbed as little as possible WHILE performing the ATOMIC
event.

The cost is that each cache will have an adder--with the sizes of
cache these days, said adder is a trifling in area and seldom used
so it is low power.

may also be possible that the block is expected to be moved to
another core such that this pushing the data manipulation to a "centralized" location would improve performance as was the
programmer intent (rather than avoiding contention overhead). (I
suppose in My 66000, a programmer would use a push/"prefetch"
instruction to move the cache block to a more central location,
but even that might be sub-optimal if the hardware can predict
the next data consumer such that centrally located would be
slower than shifting the data closer to the consumer.)

I have been betting that (in the general case) software will
remain incapable of doing such predictions, for quite some time.

If the contention is from false sharing (having multiple atomic
data in a cache block seems to be considered bad programming
practice, so this should not be common unless cache block size
grows), hardware could theoretically provide special word caches
(or "lossy" block compression where part of the block is dropped)
for moderating the impact of false sharing. This would change the optimization preferences for the program (more compact data might
be preferred if false sharing is less of a problem).

I do not know what the best interface would be, but it seems that
some care should be taken to account for differing intent when a programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
hints and directives can have unexpected performance changes
under different microarchitectures or different usage.

Clever microarchitecture can make some optimizations sub-optimal
as well as cause programmers to ask "why didn't you do it the way
I told you to do it?!"

Instruction scheduling is in effect a optimization for one implementation that may not hold for other implementations. In Order pipelines need instruction scheduling, OoO do not, GBOoO generally perform better if
/when instructions are not (or lesser) scheduled.

Loop unrolling is a case where if your machine has vVM the code
runs faster when not {unrolled and executed with branch prediction}
than when {}.

(I think I may not be communicating well. I am kind of tired
right now and this topic is complicated.)

--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 16:42:56 2026

From Newsgroup: comp.arch

On 12/19/25 12:41 PM, Anton Ertl wrote:

John Savard <quadibloc@invalid.invalid> writes:

On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:

Or in other words, if you can decode K-instructions per cycle, you'd
better be able to execute K-instructions per cycle--or you have a
serious blockage in your pipeline.

No.

If you flipped "decode" and "execute" in that sentence above, I would 100% >> agree. And maybe this _is_ just a typo.

But if you actually did mean that sentence exactly as written, I would
disagree. This is why: I regard executing instructions as 'doing the
actual work' and decoding instructions as... some unfortunate trivial
overhead that can't be avoided.

It does not matter what "the actual work" is and what isn't. What
matters is how expensive it is to make a particular part wider, and
how paying that cost benefits the IPC. At every step you add width to
the part with the best benefit/cost ratio.

And looking at recent cores, we see that, e.g., Skymont can decode
3x3=9 instructions per cycle, rename 8 per cycle, has 26 ports to
functional units (i.e., can execute 26 uops in one cycle); I don't
know how many instructions it can retire per cycle, but I expect that
it is more than 8 per cycle.

So the renamer is the bottleneck, and that's also the idea behind
top-down microarchitecture analysis (TMA) for determining how software interacts with the microarchitecture. That idea is coming out of
Intel, but if Intel is finding it hard to make wider renamers rather
than wider other parts, I expect that the rest of the industry also
finds that hard (especially for architectures where decoding is
cheaper), and (looking at ARM A64) where instructions with more
demands on the renamer exist.

It is not clear to me that the renamer is clearly the bottleneck
merely because it is narrower. Wider rename might not increase
performance that much. The 9-instruction decode is a consequence
of using three similar 3-wide decoders with predicting three
targets (so taken branches can be handled). If the target
predictor for straightline code is for each three instructions,
then all the decoders should be the same width. Avoiding 4-wide
decoders makes sense both to support more taken branches and to
avoid the extra instruction length determination complexity.

(Since rename is 8 -|ops and decode is 9 _instructions_ the width
difference is greater than it might seem from the numbers. I
think x86 designs still have more instruction cracking than
instruction fusion.)

A component can be small because it is expensive to make it
larger or because there is less benefit to making it larger. I
would consider the former a bottleneck for the designers, but
the later seems to imply that other constraints should be
loosened before more design effort is applied to that component.

I have not been paying attention to achieved IPC, but I got the
impression that for messy "integer" code achieved IPC was not up
to 4. While the front end needs to process more instructions
than are committed due to branch mispredictions, it seems at
least plausible that wider rename is not that helpful generally.

Of course, supporting bursts makes sense rCo otherwise execute
would not be so wide nor retirement rCo but more order-constrained
front-end may not benefit as much from enabling bursts of high
activity.

With banking and replication a lot of RAT read and write ports
could be supported and if rename-width internal forwarding
avoids RAT access (instead transferring from the free list) the
port count could be further reduced. Merging reused source names
could also reduce RAT port count.

I am not suggesting that increasing rename width is trivial; it
seems somewhat similar to the dependency tracking for in-order
superscalar execution, which is part of what motivated VLIW.

(In theory, dependent operations could have higher rename
latency because they cannot execute for at least one cycle after
the source value providing operation. I do not recall reading
any proposal to exploit such and it would increase the
complexity of filling the scheduler. Perhaps such variable
rename latency might be useful in a banked RAT where bank
conflicts could delay renaming. Preferring non-dependent (and
older) operations might reduce the impact of bank contention.)

I *suspect* that there is some architectural and
microarchitectural opportunity to reduce the overhead of
renaming, checkpointing, and other out-of-order execution
mechanisms. Caching dependency information in a decoded
instruction cache might be helpful, e.g.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Feb 8 17:48:13 2026

From Newsgroup: comp.arch

On 2/5/26 4:27 PM, MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Paul Clayton <paaronclayton@gmail.com> posted:

[snip]

LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.

Can recognize:

LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]

Cannot recognize:

LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional

When would one want to decouple LL and SC into function calls
away from the computation? Perhaps for in-place software
instrumenation?

Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".

There is a speed of light problem here. Communicating across a
computer <system> is a microsecond time problem, whereas executing
instructions is a nanosecond time problem.

And this is exactly where Add-to-Memory gains over Interferable
ATOMIC events--you only pay the latency once, now while the latency
is higher than possible with LL-SC, it is WAY LOWER than worst case
with LL-SC under serious contention.

Yes. Even a single adder would have higher throughput than ping-
ponging a cache block. One might even support a three-or-more
input two-or-more result adder to improve throughtput (or
perhaps exploit usually smaller addends) to increase throughput,
though I suspect there would practically never be a case where a
simple adder would have insufficient throughput.

This seems to be similar to branch hints and predication in that
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.

Explain

Branch hints can be intended to reduce branch predictor aliasing
(i.e., assume the static hint is used instead of a dynamic
predictor), to provide agree information, to prefer one path
even if it is less likely, to provide an initialization of the
(per-address component only?) branch predictor, or for some
other motive. The interface/architecture might not be specific
about how such information will be used, especially if it is a
hint (and programmers might disagree about what the best
interface would be). If the interface is not very specific, a
microarchitecture might violate a programmer's
desire/expectation by ignoring the hint or using it in a
different way.

Similarly predication can be motivated to avoid fetch
redirection (initial ARM and My 66000), to facilitate constant
time execution, to avoid the performance cost of branch
mispredictions, or perhaps for some reason that does not come to
mind. Predicate prediction would foil constant time execution
and might reduce performance (or merely introduce weird
performance variation). Even the fetch optimization might be
undone if the hardware discovers that the condition is extremely
biased and folds out the rarely used instructions; which would
be good for performance if the bias continues, but if the bias
changes just frequently enough it could hurt performance.

[snip]

There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.

Except that prediction could violate the time constancy assumed
by the programmer.

If the requesting core has the cache block in exclusive or
modified state, remote execution might be less efficient. Yet it

In My 66000 architecture, an atomic operation on memory will have
the property of being executed where the cache line in a modifiable
state happens to reside. Here you don't want to "bring data in"
nor do you want to "push data out" you want the memory hierarchy
to be perturbed as little as possible WHILE performing the ATOMIC
event.

Okay, that makes sense. The earlier statement implied (to me)
that the operation was always "centralized".

The cost is that each cache will have an adder--with the sizes of
cache these days, said adder is a trifling in area and seldom used
so it is low power.

may also be possible that the block is expected to be moved to
another core such that this pushing the data manipulation to a
"centralized" location would improve performance as was the
programmer intent (rather than avoiding contention overhead). (I
suppose in My 66000, a programmer would use a push/"prefetch"
instruction to move the cache block to a more central location,
but even that might be sub-optimal if the hardware can predict
the next data consumer such that centrally located would be
slower than shifting the data closer to the consumer.)

I have been betting that (in the general case) software will
remain incapable of doing such predictions, for quite some time.

A bet with very good odds in general, but I am sure there are
still more than one "Mel" around who could optimize data
movement.

In cases like simple pipelines, the data communication pattern
is obvious. For an embedded system, communicating stores might
go directly to another cores local memory. With more general
purpose compute, thread migration would be more common (even
with more cores than runnable threads) and abstracting a
communication interface might be more challenging.

[snip]

Clever microarchitecture can make some optimizations sub-optimal
as well as cause programmers to ask "why didn't you do it the way
I told you to do it?!"

Instruction scheduling is in effect a optimization for one implementation
that may not hold for other implementations. In Order pipelines need
instruction scheduling, OoO do not, GBOoO generally perform better if
/when instructions are not (or lesser) scheduled.

Even OoO implementations can benefit from facilitating
instruction fusion and perhaps even scheduler allocation to
reduce communication overhead. A RAT-based renamer might also
benefit from having duplicate register names in the same rename
chunk. With in-order renaming, in theory a compiler could also
optimize RAT bank conflicts. This might not be considered
scheduling, but when the instruction stream is serial placement
is effectively scheduling.

Loop unrolling is a case where if your machine has vVM the code
runs faster when not {unrolled and executed with branch prediction}
than when {}.

Yeah.
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Feb 9 19:28:25 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> posted:

On 2/5/26 4:27 PM, MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Paul Clayton <paaronclayton@gmail.com> posted:

[snip]

LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.

Can recognize:

LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]

Cannot recognize:

LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional

When would one want to decouple LL and SC into function calls
away from the computation? Perhaps for in-place software
instrumenation?

Write, in pure K&R C, the functionality for LoadLocked and
StoreConditional.

Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".

There is a speed of light problem here. Communicating across a
computer <system> is a microsecond time problem, whereas executing
instructions is a nanosecond time problem.

And this is exactly where Add-to-Memory gains over Interferable
ATOMIC events--you only pay the latency once, now while the latency
is higher than possible with LL-SC, it is WAY LOWER than worst case
with LL-SC under serious contention.

Yes. Even a single adder would have higher throughput than ping-
ponging a cache block. One might even support a three-or-more
input two-or-more result adder to improve throughtput (or
perhaps exploit usually smaller addends) to increase throughput,
though I suspect there would practically never be a case where a
simple adder would have insufficient throughput.

This seems to be similar to branch hints and predication in that
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.

Explain

Branch hints can be intended to reduce branch predictor aliasing
(i.e., assume the static hint is used instead of a dynamic
predictor), to provide agree information, to prefer one path
even if it is less likely, to provide an initialization of the
(per-address component only?) branch predictor, or for some
other motive. The interface/architecture might not be specific
about how such information will be used, especially if it is a
hint (and programmers might disagree about what the best
interface would be). If the interface is not very specific, a microarchitecture might violate a programmer's
desire/expectation by ignoring the hint or using it in a
different way.

Similarly predication can be motivated to avoid fetch
redirection (initial ARM and My 66000), to facilitate constant
time execution, to avoid the performance cost of branch
mispredictions, or perhaps for some reason that does not come to
mind. Predicate prediction would foil constant time execution
and might reduce performance (or merely introduce weird
performance variation). Even the fetch optimization might be
undone if the hardware discovers that the condition is extremely
biased and folds out the rarely used instructions; which would
be good for performance if the bias continues, but if the bias
changes just frequently enough it could hurt performance.

[snip]

There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.

Except that prediction could violate the time constancy assumed
by the programmer.

Time constancy is provided by execution both then clause and else clause
and using CMOV to decide on flow.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Mon Feb 9 20:36:54 2026

From Newsgroup: comp.arch

On 2/9/26 2:28 PM, MitchAlsup wrote:

Paul Clayton <paaronclayton@gmail.com> posted:

On 2/5/26 4:27 PM, MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Paul Clayton <paaronclayton@gmail.com> posted:

[snip]

LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.

Can recognize:

LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]

Cannot recognize:

LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional

When would one want to decouple LL and SC into function calls
away from the computation? Perhaps for in-place software
instrumenation?

Write, in pure K&R C, the functionality for LoadLocked and
StoreConditional.

Then why would the compiler not inline such? It seems reasonable
(to me) to blame poor scaling performance in that case on the
compiler which did not inline a one instruction function (or on
the developer who intentionally disabled such optimization).

[snip]

[snip]

There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.

Except that prediction could violate the time constancy assumed
by the programmer.

Time constancy is provided by execution both then clause and else clause
and using CMOV to decide on flow.

This means ordinary My 66000 predication cannot be used for
such. I have no idea whether writers of cryptographic software
would be upset that a mechanism which seems like it would
provide constant time is documented not to do so. (I would
_guess_ that most of the embedded strong guarantee real time
software might select hardware that never does predict
predication since caches and similar general purpose
optimizations tend to increase the difficulty of providing
strong timing guarantees.)

Is My 66000 going to have CMOV? What about something like some
of the AArch64 simple conditional operations? (Yes, such gets
un-RISCy both from count of defined instructions and instruction
complexity, but I have read that the supported operations are
somewhat common for single instruction hammock branches and are
easier to fit microarchitecturally and to fit within a 32-bit
instruction.) The code bloat (8 bytes vs. 4 bytes) and indirect
encoding (i.e., vagueries of idiom recognition rCo though an
encoded instruction can be implemented with performance that
breaks expectations) are disadvantages of just using PRED, but
PRED also removes artificial restrictions like only incrementing
by 1 (such that idiom recognition could be implemented to fast
path any conditional add_immediate).

One interesting case for predication is code like the following:
if (a < 25) {a += 8;}

The predicate is known to be available at least early enough to
nullify the addition executed in parallel with the compare
because the register operand is the same and the only other
operands are immediates. It is not clear if this presents a
useful optimization opportunity, but it seems an interesting
case.

I am also curious about what you view as the trade-offs of
conditional move compared with conditional select. Obviously
conditional select potentially includes a register copy and for
typical OoO implementations the mechanism is similar because a
conditional move renames the destination and reads the old value
and the alternative value. Encoding the condition source may be
harder for conditional select (e.g., test register A is/is not
zero, based on test place register B or register C in register
D, which requires four register names while conditional move
exploits that one of the sources is the same as the
destination); for My 66000 a generalized condition after a
comparison would, I think, require a 6-bit condition bit
specifier and a register source for the condition, which just
fits in a 32-bit instruction (6-bit opcode, 6-bit condition bit
specifier, four 5-bit register names).
--- Synchronet 3.21b-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Feb 10 15:55:05 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> writes:

On 2/9/26 2:28 PM, MitchAlsup wrote:

This means ordinary My 66000 predication cannot be used for
such. I have no idea whether writers of cryptographic software
would be upset that a mechanism which seems like it would
provide constant time is documented not to do so. (I would
_guess_ that most of the embedded strong guarantee real time
software might select hardware that never does predict
predication since caches and similar general purpose
optimizations tend to increase the difficulty of providing
strong timing guarantees.)

ARMv8 includes a "DIT" feature. Data Independent Timing. If
that feature is present and enabled, software can be
assured that a DIT capable instruction that processes
data completes in the same time, regardless of the
data being processed. Specifically applies to instructions that
manipulate arithmetic data (loads, stores, integer, floating point,
conditional compares, crypto instructions, etc).

https://developer.arm.com/documentation/ddi0601/2025-12/AArch64-Registers/DIT--Data-Independent-Timing?lang=en

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Feb 10 19:02:09 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> posted:

On 2/9/26 2:28 PM, MitchAlsup wrote:

Paul Clayton <paaronclayton@gmail.com> posted:

On 2/5/26 4:27 PM, MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Paul Clayton <paaronclayton@gmail.com> posted:

[snip]

LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.

Can recognize:

LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]

Cannot recognize:

LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional

When would one want to decouple LL and SC into function calls
away from the computation? Perhaps for in-place software
instrumenation?

Write, in pure K&R C, the functionality for LoadLocked and StoreConditional.

Then why would the compiler not inline such? It seems reasonable
(to me) to blame poor scaling performance in that case on the
compiler which did not inline a one instruction function (or on
the developer who intentionally disabled such optimization).

Because in pure K&R C there is no concept of atomic things, and
thus one has to resort to ASM--beyond this, there is no concept
of inlining.

[snip]

[snip]

There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.

Except that prediction could violate the time constancy assumed
by the programmer.

Time constancy is provided by execution both then clause and else clause and using CMOV to decide on flow.

This means ordinary My 66000 predication cannot be used for
such. I have no idea whether writers of cryptographic software
would be upset that a mechanism which seems like it would
provide constant time is documented not to do so.

I neither stated such nor implied such.

(I would
_guess_ that most of the embedded strong guarantee real time
software might select hardware that never does predict
predication since caches and similar general purpose
optimizations tend to increase the difficulty of providing
strong timing guarantees.)

Is My 66000 going to have CMOV?

Has had it for years, it also has multi-instruction predication
since forever.

What about something like some
of the AArch64 simple conditional operations? (Yes, such gets
un-RISCy both from count of defined instructions and instruction
complexity, but I have read that the supported operations are
somewhat common for single instruction hammock branches and are
easier to fit microarchitecturally and to fit within a 32-bit
instruction.) The code bloat (8 bytes vs. 4 bytes) and indirect
encoding (i.e., vagueries of idiom recognition rCo though an
encoded instruction can be implemented with performance that
breaks expectations) are disadvantages of just using PRED, but
PRED also removes artificial restrictions like only incrementing
by 1 (such that idiom recognition could be implemented to fast
path any conditional add_immediate).

One interesting case for predication is code like the following:
if (a < 25) {a += 8;}

The predicate is known to be available at least early enough to
nullify the addition executed in parallel with the compare
because the register operand is the same and the only other
operands are immediates. It is not clear if this presents a
useful optimization opportunity, but it seems an interesting
case.

I am also curious about what you view as the trade-offs of
conditional move compared with conditional select. Obviously
conditional select potentially includes a register copy and for
typical OoO implementations the mechanism is similar because a
conditional move renames the destination and reads the old value
and the alternative value. Encoding the condition source may be
harder for conditional select (e.g., test register A is/is not
zero, based on test place register B or register C in register
D, which requires four register names while conditional move
exploits that one of the sources is the same as the
destination); for My 66000 a generalized condition after a
comparison would, I think, require a 6-bit condition bit
specifier and a register source for the condition, which just
fits in a 32-bit instruction (6-bit opcode, 6-bit condition bit
specifier, four 5-bit register names).

Clang goes out of its way to convert things like the above into
CMOV form. Often this pre-optimization takes more instructions
and runs slower than pure-predication. Almost always when then-
clause or else-clause is not-already in a register, the CMOV
version is longer and slower.
--- Synchronet 3.21b-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Feb 10 20:23:12 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Paul Clayton <paaronclayton@gmail.com> posted:

On 2/9/26 2:28 PM, MitchAlsup wrote:

<snip>

Write, in pure K&R C, the functionality for LoadLocked and
StoreConditional.

Then why would the compiler not inline such? It seems reasonable
(to me) to blame poor scaling performance in that case on the
compiler which did not inline a one instruction function (or on
the developer who intentionally disabled such optimization).

Because in pure K&R C there is no concept of atomic things, and
thus one has to resort to ASM--beyond this, there is no concept
of inlining.

I do believe that it would be possible in K&R C.

Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.

Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
--- Synchronet 3.21b-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Tue Feb 10 20:39:11 2026

From Newsgroup: comp.arch

According to Scott Lurndal <slp53@pacbell.net>:

I do believe that it would be possible in K&R C.

Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.

Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).

Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.

We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21b-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Feb 10 21:15:08 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

According to Scott Lurndal <slp53@pacbell.net>:

I do believe that it would be possible in K&R C.

Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.

Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).

Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.

The 11/34 was what I had been using in the late 70's (Unix V6)
and a bit of RSTS/E. That was after several years of TSS8.24
where self-modifying code was de rigueur.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Feb 11 09:19:14 2026

From Newsgroup: comp.arch

John Levine wrote:

According to Scott Lurndal <slp53@pacbell.net>:

I do believe that it would be possible in K&R C.

Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.

Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).

Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.

We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner comparison loop, but we had enough trouble getting our programs to
work the normal way.

40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was trivially comparable.

The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal prefix.

This could well be faster than a complicated sort operator, even if it
was run-time generated, particularly since that runtime code would need
to be called.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 11 22:11:09 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

John Levine wrote:

According to Scott Lurndal <slp53@pacbell.net>:

I do believe that it would be possible in K&R C.

Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.

Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).

Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.

We knew it was possible to do runtime code generation, since it's an ancient technique widely used by sort programs to speed up the inner comparison loop, but we had enough trouble getting our programs to
work the normal way.

40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was trivially comparable.

40 years ago, Duff's device was considered a reasonable way to improve performance. It is now considered a means to take over applications
and machines (i.e., a virus).

The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal prefix.

This could well be faster than a complicated sort operator, even if it
was run-time generated, particularly since that runtime code would need
to be called.

Terje

--- Synchronet 3.21b-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Wed Feb 11 22:12:06 2026

From Newsgroup: comp.arch

According to Terje Mathisen <terje.mathisen@tmsw.no>:

We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.

40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was >trivially comparable.

The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal >prefix.

I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
deal of work into making it fast.

See, for example:

https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21b-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Feb 12 14:23:50 2026

From Newsgroup: comp.arch

John Levine wrote:

According to Terje Mathisen <terje.mathisen@tmsw.no>:

We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.

40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was
trivially comparable.

The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal
prefix.

I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
deal of work into making it fast.

See, for example:

https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf

My "Algorithms and Data Structures" professor told us that IBM had
patented a zero-time sort chip, which they then promptly buried since
their internal stats said that over all IBM mainframes, 60% of the CPU
hours were spent in various forms of sorting.

Terje

PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was
full) you would run the channel transfer in the opposite direction to
retrieve the sorted data.
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21b-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Feb 12 15:47:01 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

According to Terje Mathisen <terje.mathisen@tmsw.no>:

We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.

40 years ago my own sort functions/programs would instead either extract >>the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was >>trivially comparable.

The final option was to extract a short prefix (8-bytes) of the key data >>and accept that the full sort would only deliver a partially sorted >>results, so that I had to do a final sort pass over each chunk of equal >>prefix.

I'm pretty sure you'll find all of those in sort programs from the 1960s. >Sorting used more computer time than anything else so they put a great
deal of work into making it fast.

Indeed, the sort subsystem on the Burroughs systems was, perhaps,
even more important that the MCP itself to a lot of customers.

Although large sorts in those days were done using magnetic tapes.

(watching a large sort-merge running on a dozen 9-track drives
was pretty impressive).
--- Synchronet 3.21b-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Thu Feb 12 19:25:03 2026

From Newsgroup: comp.arch

According to Scott Lurndal <slp53@pacbell.net>:

I'm pretty sure you'll find all of those in sort programs from the 1960s. >>Sorting used more computer time than anything else so they put a great
deal of work into making it fast.

Indeed, the sort subsystem on the Burroughs systems was, perhaps,
even more important that the MCP itself to a lot of customers.

Although large sorts in those days were done using magnetic tapes.

(watching a large sort-merge running on a dozen 9-track drives
was pretty impressive).

Yup. I suspect that the read backward feature on tape drives was invented to speed up sorting, when someone realized that if you'd written a set of tapes, you could read them backward and merge into the next set of tapes in reverse order, repeat until done. You might need an extra pass at the end if the final result was an even-numbered pass written in reverse order, but you still won overall because you didn't have to rewind between all the intermediate passes. --
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21b-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Feb 12 20:01:38 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

According to Scott Lurndal <slp53@pacbell.net>:

I'm pretty sure you'll find all of those in sort programs from the 1960s. >>>Sorting used more computer time than anything else so they put a great >>>deal of work into making it fast.

Indeed, the sort subsystem on the Burroughs systems was, perhaps,
even more important that the MCP itself to a lot of customers.

Although large sorts in those days were done using magnetic tapes.

(watching a large sort-merge running on a dozen 9-track drives
was pretty impressive).

Yup. I suspect that the read backward feature on tape drives was invented to >speed up sorting,

Indeed. The Burroughs SORT guru[*] was quite proud of his
speed enhancements when read-reverse became available.

His SORT intrinsics (SORT.(60s) and SORT:(70s-90s)) were
extremely efficient (hand written assembler, for the most part)
and used by all languages that had sorting capabilities (COBOL68/74/85,
BPL, SPRITE).

[*] RIP Bernie Buller, 2023/09/11.

(He was also a fan of folk music and Oldsmobiles, and a tribute is up on youtube).
--- Synchronet 3.21b-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Thu Feb 12 23:07:16 2026

From Newsgroup: comp.arch

It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:

My "Algorithms and Data Structures" professor told us that IBM had
patented a zero-time sort chip, which they then promptly buried since
their internal stats said that over all IBM mainframes, 60% of the CPU
hours were spent in various forms of sorting.

Terje

PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was >full) you would run the channel transfer in the opposite direction to >retrieve the sorted data.

Sounds like an urban legend to me. For one thing, on machines of that
era no useful sort fit in core so it spent most of its time doing disk
and tape I/O anyway.

For another, if it really did make a difference, they would have
called it the Improved Sort Performance Facility and sold it as an
extra cost option.

These days z/Series has the optional Enhanced-Sort Facility added in
September 2020 which adds a complex SORT LISTS instruction that does
in-memory sorts or merges presuably in microcode so it can run at full
memory speed.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21b-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Thu Feb 12 23:13:23 2026

From Newsgroup: comp.arch

According to John Dallman <jgd@cix.co.uk>:

In article <10ml9ef$264n$1@gal.iecc.com>, johnl@taugh.com (John Levine) >wrote:

I suspect that the read backward feature on tape drives was
invented to speed up sorting, when someone realized that if
you'd written a set of tapes, you could read them backward
and merge into the next set of tapes in reverse order, repeat
until done.

Emerson W Pugh's _IBM's 360 & Early 370 Systems_ says that was the reason. ><https://www.amazon.co.uk/dp/0262517205>

So it does, on page 229. I have a paper copy of that book that I read long ago, so I guess I can't claim the idea was original.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Feb 13 00:04:24 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> posted:

It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:

My "Algorithms and Data Structures" professor told us that IBM had >patented a zero-time sort chip, which they then promptly buried since >their internal stats said that over all IBM mainframes, 60% of the CPU >hours were spent in various forms of sorting.

Terje

PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was >full) you would run the channel transfer in the opposite direction to >retrieve the sorted data.

Sounds like an urban legend to me. For one thing, on machines of that
era no useful sort fit in core so it spent most of its time doing disk
and tape I/O anyway.

I remember hearing about the chip (circa 1986-8).
I agree that the number of entries would not have been enough to justify.
I also agree that Interrupt response time after the I/O would have
prevented any useful performance advantage.

By the mid-1990s the entry count could have been large enough to justify.

For another, if it really did make a difference, they would have
called it the Improved Sort Performance Facility and sold it as an
extra cost option.

These days z/Series has the optional Enhanced-Sort Facility added in September 2020 which adds a complex SORT LISTS instruction that does in-memory sorts or merges presuably in microcode so it can run at full
memory speed.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Feb 12 23:34:10 2026

From Newsgroup: comp.arch

On 2/11/2026 2:12 PM, John Levine wrote:

According to Terje Mathisen <terje.mathisen@tmsw.no>:

We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.

40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was
trivially comparable.

The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal
prefix.

I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
deal of work into making it fast.

See, for example:

https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf

Yup. And note that having a faster sort program than the standard IBM
one made the company Syncsort a lot of money.

https://en.wikipedia.org/wiki/Precisely_(company)
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21b-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Feb 13 11:54:15 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

John Levine wrote:

According to Scott Lurndal <slp53@pacbell.net>:

I do believe that it would be possible in K&R C.

Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.

Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).

Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.

We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.

40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was
trivially comparable.

40 years ago, Duff's device was considered a reasonable way to improve performance. It is now considered a means to take over applications

Was it ever that?

Yes, it was a very compact way to express the same logic you would use
in asm to jump into the correct spot of an unrolled loop, but did it
actually save any time compared to a plain unroll followed by a final fixup?

and machines (i.e., a virus).

Duff would never compile into anything strange, but as a way to
obfuscate it could work I guess? Absolutley no need to flag it as malware/virus!

My own code which would jump into the immediate constant part of another instruction, treating data as code, could reasonably trigger some warnings.

OTOH, "The Story of Mel" will forwever be the shining star for us
oldtimers, right?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21b-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Feb 13 11:36:05 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

John Levine <johnl@taugh.com> posted:

It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:

My "Algorithms and Data Structures" professor told us that IBM had
patented a zero-time sort chip, which they then promptly buried since
their internal stats said that over all IBM mainframes, 60% of the CPU
hours were spent in various forms of sorting.

Terje

PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was >>> full) you would run the channel transfer in the opposite direction to
retrieve the sorted data.

Sounds like an urban legend to me. For one thing, on machines of that
era no useful sort fit in core so it spent most of its time doing disk
and tape I/O anyway.

I remember hearing about the chip (circa 1986-8).
I agree that the number of entries would not have been enough to justify.
I also agree that Interrupt response time after the I/O would have
prevented any useful performance advantage.

By the mid-1990s the entry count could have been large enough to justify.

For another, if it really did make a difference, they would have
called it the Improved Sort Performance Facility and sold it as an
extra cost option.

These days z/Series has the optional Enhanced-Sort Facility added in
September 2020 which adds a complex SORT LISTS instruction that does
in-memory sorts or merges presuably in microcode so it can run at full
memory speed.

A rCLZero-TimerCY VLSI Sorter, 1983, IBM Journal of Research and Development https://scholar.archive.org/work/jnapdo24jbfaxhnxbiybonbqxu/access/wayback/http://pdfs.semanticscholar.org/e72e/903f94fb7c121efca13a717a1e713b3321f1.pdf

"A hardware sorter suitable for VLSI implementation is proposed.
It operates in a parallel and pipelined fashion, with the actual
sorting time absorbed by the input/output time. A detailed VLSI
implementation is described which has a very favorable device
count compared to existing static RAM."

--- Synchronet 3.21b-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Fri Feb 13 18:53:59 2026

From Newsgroup: comp.arch

According to Terje Mathisen <terje.mathisen@tmsw.no>:

40 years ago, Duff's device was considered a reasonable way to improve
performance. It is now considered a means to take over applications

Was it ever that?

Yes, it was a very compact way to express the same logic you would use
in asm to jump into the correct spot of an unrolled loop, but did it >actually save any time compared to a plain unroll followed by a final fixup?

Time, no, but on a 64K machine space was precious and it got much of the benefit of unrolling without as much space.

Duff would never compile into anything strange, but as a way to
obfuscate it could work I guess? Absolutley no need to flag it as >malware/virus!

Agreed, it was ugly but quite legit.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21b-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Sat Feb 14 10:29:35 2026

From Newsgroup: comp.arch

On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:

40 years ago, Duff's device was considered a reasonable way to improve >performance. It is now considered a means to take over applications
and machines (i.e., a virus).

I don't know that it ever really improved performance, but it still is
useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup
steps succeed.

I keep track of how many setup steps successfully complete, and then,
whatever the exit situation, I jump into a Duff switch that tears it
all down in reverse order of the setup.
--- Synchronet 3.21b-Linux NewsLink 1.2

From jm@jm@bourguet.org to comp.arch on Sat Feb 14 21:04:05 2026

From Newsgroup: comp.arch

George Neuner <gneuner2@comcast.net> writes:

On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

40 years ago, Duff's device was considered a reasonable way to improve >>performance. It is now considered a means to take over applications
and machines (i.e., a virus).

I don't know that it ever really improved performance, but it still is useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup
steps succeed.

I keep track of how many setup steps successfully complete, and then, whatever the exit situation, I jump into a Duff switch that tears it
all down in reverse order of the setup.

This seems missing an important aspect of Duff switch: the use of the possibility to put case labels inside control structure. Duff's device use
a while starting before the first label, but you can do other strange
things like jumping into an if:

switch(i) {
case 1:
if (j == 3) {
case 2:
printf("i == 2 || (i == 1 && j == 3)\n");
}
}

or is you tears down complex enough to use such things?

Yours,
--
Jean-Marc
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Feb 14 20:17:05 2026

From Newsgroup: comp.arch

jm <jm@bourguet.org> posted:

George Neuner <gneuner2@comcast.net> writes:

On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

40 years ago, Duff's device was considered a reasonable way to improve >>performance. It is now considered a means to take over applications
and machines (i.e., a virus).

I don't know that it ever really improved performance, but it still is useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup steps succeed.

I keep track of how many setup steps successfully complete, and then, whatever the exit situation, I jump into a Duff switch that tears it
all down in reverse order of the setup.

This seems missing an important aspect of Duff switch: the use of the possibility to put case labels inside control structure. Duff's device use
a while starting before the first label, but you can do other strange
things like jumping into an if:

switch(i) {
case 1:
if (j == 3) {
case 2:
printf("i == 2 || (i == 1 && j == 3)\n");
}
}

----------------------------------------------------------------
if( i == 2 || (i == 1 && j == 3) )
printf("i == 2 || (i == 1 && j == 3)\n");

is going to be fewer instructions.

or is you tears down complex enough to use such things?

Yours,

--- Synchronet 3.21b-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Sat Feb 14 17:58:00 2026

From Newsgroup: comp.arch

On Sat, 14 Feb 2026 21:04:05 +0100, jm <jm@bourguet.org> wrote:

George Neuner <gneuner2@comcast.net> writes:

On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:

40 years ago, Duff's device was considered a reasonable way to improve >>>performance. It is now considered a means to take over applications
and machines (i.e., a virus).

I don't know that it ever really improved performance, but it still is
useful. I use the switch() form for backing out of situations having
complicated setup where processing isn't possible unless all the setup
steps succeed.

I keep track of how many setup steps successfully complete, and then,
whatever the exit situation, I jump into a Duff switch that tears it
all down in reverse order of the setup.

This seems missing an important aspect of Duff switch: the use of the >possibility to put case labels inside control structure. Duff's device use
a while starting before the first label, but you can do other strange
things like jumping into an if:

switch(i) {
case 1:
if (j == 3) {
case 2:
printf("i == 2 || (i == 1 && j == 3)\n");
}
}

or is you tears down complex enough to use such things?

Yours,

I've never had to do anything like your example. I realize that it
would work, but in fact I've actually never seen anything like that
done in real life.

For me, the utility of Duff's switch is that it provides multiple
entry points into ... what arguably could be considered a single block
of related code.

Of course, C permits any case in a switch to fall through. What makes
a particular case run a "Duff" (to me) is that the run includes at
least 2 cases which lead to separate[*] code, and every case in the
run falls through except the last one which exits the switch.

[*] separate. not necessarily different or unique.

I try not to put multiple "Duffs" in the same switch. I have had
occasion to do it, but only very rarely.

Certainly Duffs can do more than I use them for. I use them only for
what I need.
--- Synchronet 3.21b-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Feb 15 09:36:18 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

These days z/Series has the optional Enhanced-Sort Facility added in >September 2020 which adds a complex SORT LISTS instruction that does >in-memory sorts or merges presuably in microcode so it can run at full
memory speed.

If the microarchitecture is worth anything, microcode does not run any
faster than architectural code, when doing the same operations.

One possible reason for the additional instruction is that there is a
hardware feature that provides a speedup; in that case the hardware
feature, not the microcode is the reason for the speedup. Is there
such a hardware feature for SORT LISTS, and if so, what is it?

For such hardware features the question is what should be exposed at
the architectural level. One can use it internally in a high-level
instruction like SORT LISTS, or expose it more directly in a
lower-level instruction (e.g., Intel has AESENC, which just performs
one round of AES encryption).

The advantage of the lower-level instruction X is that it may have
other uses, and no microcode for SORT LISTS is necessary (software is
more malleable); the disadvantage is that the specific functionality
of the hardware feature is now architectural, i.e., cast in stone, and
if a different hardware feature in the future would be better or
cheaper for the SORT LISTS functionality, one still has to provide the
X functionality in some way, and then decide whether to also implement
the other functionality, or just live with X.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21b-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Feb 15 14:34:11 2026

From Newsgroup: comp.arch

John Levine wrote:

It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:

My "Algorithms and Data Structures" professor told us that IBM had
patented a zero-time sort chip, which they then promptly buried since
their internal stats said that over all IBM mainframes, 60% of the CPU
hours were spent in various forms of sorting.

Terje

PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was
full) you would run the channel transfer in the opposite direction to
retrieve the sorted data.

Sounds like an urban legend to me. For one thing, on machines of that
era no useful sort fit in core so it spent most of its time doing disk
and tape I/O anyway.

Professor Halaas (he had written the textbook we used and later on got
FAST to actually produce a search chip) quoted an actual patent number
afair.

And that was probably the main (but not only!) reason for not actually creating the hardware. Besides, patents were already a way to gain bonuses/salary increases, right?

For another, if it really did make a difference, they would have
called it the Improved Sort Performance Facility and sold it as an
extra cost option.

These days z/Series has the optional Enhanced-Sort Facility added in September 2020 which adds a complex SORT LISTS instruction that does in-memory sorts or merges presuably in microcode so it can run at full
memory speed.

As soon as a sort is large enough that it needs to start in physical RAM
and also end up there, the best you can do is to minimize the number of
cache and TLB misses: The actual branch misses probably don't matter?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21b-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Mon Feb 16 20:55:06 2026

From Newsgroup: comp.arch

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

John Levine <johnl@taugh.com> writes:

These days z/Series has the optional Enhanced-Sort Facility added in >>September 2020 which adds a complex SORT LISTS instruction that does >>in-memory sorts or merges presuably in microcode so it can run at full >>memory speed.

If the microarchitecture is worth anything, microcode does not run any
faster than architectural code, when doing the same operations.

One possible reason for the additional instruction is that there is a >hardware feature that provides a speedup; in that case the hardware
feature, not the microcode is the reason for the speedup. Is there
such a hardware feature for SORT LISTS, and if so, what is it?

It says here it's a combination of hardware and microcode, but the references are all broken or paywalled. It also says that the instruction can run for quite
a while so it might be that the microcode can run the memory at full speed better
than ordinary instrucions can.

https://blog.share.org/Technology-Article/peeking-under-the-hood-of-sort-acceleration-on-z15

This isn't their first sort speedup. A while ago they added instructions that do
the inner loop of heapsort. Not sure whether there's hardware for that.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Feb 16 22:16:11 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> posted:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

John Levine <johnl@taugh.com> writes:

These days z/Series has the optional Enhanced-Sort Facility added in >>September 2020 which adds a complex SORT LISTS instruction that does >>in-memory sorts or merges presuably in microcode so it can run at full >>memory speed.

If the microarchitecture is worth anything, microcode does not run any >faster than architectural code, when doing the same operations.

One possible reason for the additional instruction is that there is a >hardware feature that provides a speedup; in that case the hardware >feature, not the microcode is the reason for the speedup. Is there
such a hardware feature for SORT LISTS, and if so, what is it?

It says here it's a combination of hardware and microcode, but the references are all broken or paywalled. It also says that the instruction can run for quite
a while so it might be that the microcode can run the memory at full speed better
than ordinary instrucions can.

Sounds like a co-processor or an attached-processor using -|code for communication {setup and tear down}. Basically, core-CPU ships request
to Sorter(tm); sorter does its job, and gets back at lower latency than
an interrupt. Core-caches will contain post-sorted data.

https://blog.share.org/Technology-Article/peeking-under-the-hood-of-sort-acceleration-on-z15

This isn't their first sort speedup. A while ago they added instructions that do
the inner loop of heapsort. Not sure whether there's hardware for that.

--- Synchronet 3.21b-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Tue Feb 17 01:19:58 2026

From Newsgroup: comp.arch

According to John Levine <johnl@taugh.com>:

One possible reason for the additional instruction is that there is a >>hardware feature that provides a speedup; in that case the hardware >>feature, not the microcode is the reason for the speedup. Is there
such a hardware feature for SORT LISTS, and if so, what is it?

I found a 2020 IBM JR&D article that describes it in some detail.
It's some fairly simple hardware that keeps pointers to the keys
in fast SRAM so it can do what IBM calls a looser tree merge sort
and everyone else seems to call a loser tree merge sort.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21b-Linux NewsLink 1.2

From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Thu Feb 12 20:24:00 2026

From Newsgroup: comp.arch

In article <10ml9ef$264n$1@gal.iecc.com>, johnl@taugh.com (John Levine)
wrote:

I suspect that the read backward feature on tape drives was
invented to speed up sorting, when someone realized that if
you'd written a set of tapes, you could read them backward
and merge into the next set of tapes in reverse order, repeat
until done.

Emerson W Pugh's _IBM's 360 & Early 370 Systems_ says that was the reason. <https://www.amazon.co.uk/dp/0262517205>

John
--- Synchronet 3.21d-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	59
Nodes:	6 (0 / 6)
Uptime:	03:59:49
Calls:	810
Files:	1,287
D/L today:	5 files (10,057K bytes)
Messages:	203,128

Re: Variable-length instructions

Who's Online

System Info