<snip>
One would argue that maybe prefixes are themselves wonky, butI agree with this. Prefixes seem more natural, large numbers expanding
otherwise one needs:
Instructions that can directly encode the presence of large immediate
values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one views
the instruction stream as linear, whereas suffixes add weirdness and
are effectively retro-causal, and for any fetch to be safe at the end
of a cache line one would need to prove the non-existence of a suffix;
so better to not go there).
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction. I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.
Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.
It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting
for a prefix / suffix.
For the most part, superscalar works the same either way, with similar
efficiency. There is a slight efficiency boost if it would be possible
to dynamically reshuffle ops during fetch. But, this is not currently
a thing in my case.
This latter case would apply if, say, a MEM op is followed by non-
dependent ALU ops, which under current superscalar handling they will
not co-execute, but it could be possible in theory to swap the ops and
allow them to co-execute.
...
- anton
<snip>
One would argue that maybe prefixes are themselves wonky, but otherwise one needs:
Instructions that can directly encode the presence of large immediate values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix encodings; at least prefix encodings make intuitive sense if one views
the instruction stream as linear, whereas suffixes add weirdness and are effectively retro-causal, and for any fetch to be safe at the end of a cache line one would need to prove the non-existence of a suffix; so better to not go there).
I agree with this. Prefixes seem more natural, large numbers expanding
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction.
I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current instruction slot plus a following area of the cache-line to the decoder.
Handling suffixes at the end of a cache-line is not too bad if the cache already handles instructions spanning a cache line. Assume the maximum number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.
It is easier to handle interrupts with suffixes. The suffix can just be treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting
for a prefix / suffix.
- anton
On 12/31/2025 2:23 AM, Robert Finch wrote:
<snip>
One would argue that maybe prefixes are themselves wonky, butI agree with this. Prefixes seem more natural, large numbers expanding
otherwise one needs:
Instructions that can directly encode the presence of large immediate
values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one
views the instruction stream as linear, whereas suffixes add
weirdness and are effectively retro-causal, and for any fetch to be
safe at the end of a cache line one would need to prove the non-
existence of a suffix; so better to not go there).
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction. I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the
current instruction slot plus a following area of the cache-line to
the decoder.
ID stage is likely too late.
For PC advance, ideally this needs to be known by the IF stage so that
we can know how to advance PC for the next clock-cycle (for the PF stage).
Say:
-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a-a-a-a PF IF ID RF E1 E2 E3 WB
So, each IF stage producing an updated PC that needs to reach PF within
the same clock-cycle (so the SRAMs can fetch data for the correct cache- line, which happens on a clock-line edge).
This may also need to MUX PC's from things like the branch-predictor and branch-initiation logic, which then override the normal PC+Step handling generated from the IF->PF path (also typically at a low latency).
In this case, the end of the IF stage also handles some amount of
repacking;
-a Possible:
-a-a-a Right-justifying the fetched instructions;
-a-a-a 16 -> 32 bit repacking (for RV-C)
-a Current:
-a-a-a Renormalization of XG1/XG2/XG3 into the same internal scheme;
-a-a-a Repacking 48-bit RISC-V ops into internal 64-bit forms;
-a-a-a ...
As a partial result of this repacking, the instruction words effectively gain a few extra bits (the "internal normalized format" no longer
fitting entirely into a 32-bit word; where one could almost see it as a
sort of "extended instruction" that includes both ISAs in a single slightly-larger virtual instruction word).
One could go further and try to re-normalize the full instruction
layout, but as noted XG3 and RV would still differ enough as to make
this annoying (mostly the different encoding spaces and immed formats).
* zzzzzzz-ooooo-mmmmm-zzz-nnnnn-yy-yyy11
* zzzz-oooooo-mmmmmm-zzzz-nnnnnn-yy-yyPw
With a possible normalized format (36-bit):
* zzzzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyyyPw
* zzzzzzz-0ooooo-0mmmmm-yzzz-0nnnnn-1yyyy10 (RV Repack)
* 000zzzz-oooooo-mmmmmm-zzzz-nnnnnn-0yyyyPw (XG3 Repack)
Couldn't fully unify the encoding space within a single clock cycle
though (within a reasonable cost budget).
At present, the decoder handling is to essentially unify the 32-bit
format for XG1/XG2/XG3 as XG2 with a few tag bits to disambiguate which
ISA decoding rules should apply for the 32-bit instruction word in
question. The other option would have been to normalize as XG3, but XG3 loses some minor functionality from XG1 and XG2.
I also went against allowing RV and XG3 jumbo prefixes to be mixed.
Though, it is possible exceptions could be made.
Wouldn't have needed J52I if XG3 prefixes could have been used with RV
ops, but can't use XG3 prefixes in RV-C mode, which is part of why I
ended up resorting to the J52I prefix hack. But, still doesn't fully
address the issues that exist with hot-patching in this mode.
Though, looking at options, the "cheapest but fastest" option at present likely being:
Core that only does XG3, possibly dropping the RV encodings and re-
adding WEX in its place (though, in such an XG3-Only mode, the 10/11
modes would otherwise be identical in terms of encoding).
Or, basically, XG3 being used in a way more like how XG2 was used.
But, don't really want to create yet-more modes at the moment. XG3 being used as superscalar isn't too much more expensive, and arguably more flexible given the compiler doesn't need to be aware of pipeline
scheduling specifics, but can still make use of this when trying to
shuffle instructions around for efficiency (a mismatch will then merely result in a small reduction in efficiency rather than a potential
inability of the code to run; though for XG2 there was the feature that
the CPU could fall back to scalar or potential superscalar operation in cases where the compiler's bundling was incompatible with what the CPU allowed).
So, it is possible that in-order superscalar may be better as a general purpose option even if not strictly the cheapest option.
A case could maybe be made arguing for dropping back down to 32 GPRs
(with no FPRs) for more cheapness, but as-is, trying to do 128-bit SIMD stuff in RV64 mode also tends to quickly run into issues with register pressure.
Well, and I was just recently having to partly rework the mechanism for:
-a v = (__vec4f) { x, y, z, w };
To not try to load all the registers at the same time, as this was occasionally running out of free dynamic registers with the normal RV
ABI (and 12 callee-save FPRs doesn't go quite so far when allocating
pairs of them), which effectively causes the compiler to break.
It is almost tempting to consider switching RV64 over to the XG3 ABI
when using SIMD, well, and/or not use SIMD with RV64 because it kinda
sucks worse than XG3.
But... Comparably, for the TKRA-GL front-end (using syscalls for the back-end), using runtime calls and similar for vector operations does
still put a big dent in the framerate for GLQuake (so, some sort of SIMD
in RV mode may still be needed even if "kinda inferior").
Handling suffixes at the end of a cache-line is not too bad if the
cache already handles instructions spanning a cache line. Assume the
maximum number of suffixes is present and ensure the cache-line is
wide enough. Or limit the number of suffixes so they fit into the half
cache-line used for spanning.
Difference:
With a prefix, you know in advance the prefix exists (the prefix is immediately visible);
With a suffix, can only know it exists if it is visible.
So, it poses similar issues to those in making superscalar fetch work
across cache lines, which is made more of a challenge if one wants to
make superscalar across line boundaries work during I$ miss handling
rather than during instruction fetch.
But, if the logic is able to run at I$ miss time, ideally it can see
whether or not it is a single-wide or multi-wide instruction (say,
because this logic runs only on a single cache line at a time).
It is easier to handle interrupts with suffixes. The suffix can just
be treated as a NOP. Adjusting the position of the hardware interrupt
to the start of an instruction then does not have to worry about
accounting for a prefix / suffix.
Usual behavior is that when an interrupt occurs, SPC or similar always points at a valid PC, namely one from the pipeline, and usually/ideally,
the exact instruction on which the fault occurred (though this does get
a little fiddly with multiple instructions in the pipeline, and the CPU sometimes needs to figure out which stage corresponds to the fault in question; but this is usually more of an issue for I$ TLB misses, which often/usually trigger during a branch).
For the most part, superscalar works the same either way, with
similar efficiency. There is a slight efficiency boost if it would be
possible to dynamically reshuffle ops during fetch. But, this is not
currently a thing in my case.
This latter case would apply if, say, a MEM op is followed by non-
dependent ALU ops, which under current superscalar handling they will
not co-execute, but it could be possible in theory to swap the ops
and allow them to co-execute.
...
- anton
On 2026-01-06 5:42 p.m., BGB wrote:
On 12/31/2025 2:23 AM, Robert Finch wrote:
<snip>
One would argue that maybe prefixes are themselves wonky, butI agree with this. Prefixes seem more natural, large numbers
otherwise one needs:
Instructions that can directly encode the presence of large
immediate values, etc;
Or, the use of suffix-encodings (which is IMHO worse than prefix
encodings; at least prefix encodings make intuitive sense if one
views the instruction stream as linear, whereas suffixes add
weirdness and are effectively retro-causal, and for any fetch to be
safe at the end of a cache line one would need to prove the non-
existence of a suffix; so better to not go there).
expanding to the left, suffixes seem like a big-endian approach. But
I use suffixes for large constants. I think with most VLI constant
data follows the instruction. I find constant data easier to work
with that way and they can be processed in the same clock cycle as a
decode so they do not add to the dynamic instruction count. Just pass
the current instruction slot plus a following area of the cache-line
to the decoder.
ID stage is likely too late.
For PC advance, ideally this needs to be known by the IF stage so that
we can know how to advance PC for the next clock-cycle (for the PF
stage).
Say:
-a-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a-a PF IF ID RF E1 E2 E3 WB
-a-a-a-a-a-a-a-a PF IF ID RF E1 E2 E3 WB
The PC advance works okay without knowing whether there is a suffix
present or not. The suffix is treated like a NOP instruction. There is
no decode required at the fetch stage. The PC can land on a suffix. It
just always advances by four (N) instructions unless there is a branch.
Well, anyways, have since gotten SIMD in RV64 mode working "slightly
It is almost tempting to consider switching RV64 over to the XG3 ABI
when using SIMD, well, and/or not use SIMD with RV64 because it kinda
sucks worse than XG3.
But... Comparably, for the TKRA-GL front-end (using syscalls for the
back-end), using runtime calls and similar for vector operations does
still put a big dent in the framerate for GLQuake (so, some sort of
SIMD in RV mode may still be needed even if "kinda inferior").
Handling suffixes at the end of a cache-line is not too bad if the
cache already handles instructions spanning a cache line. Assume the
maximum number of suffixes is present and ensure the cache-line is
wide enough. Or limit the number of suffixes so they fit into the
half cache-line used for spanning.
Difference:
With a prefix, you know in advance the prefix exists (the prefix is
immediately visible);
With a suffix, can only know it exists if it is visible.
Instead with a prefix one only knows the instruction exists if it is visible. I do not think it makes much difference. Except that it may be harder to decode an instruction and look for a prefix that comes before it.
Robert Finch <robfi680@gmail.com> posted:
<snip>
One would argue that maybe prefixes are themselves wonky, but otherwise
one needs:
Instructions that can directly encode the presence of large immediate
values, etc;
This is the direction of My 66000.
The instruction stream is a linear stream of words.
The first word of each instruction encodes its total length.
What follows the instruction itself are merely constants used as
operands in the instruction itself. All constants are 1 or 2
words in length.
I would not call this means "prefixed" or "suffixed". Generally,
prefixes and suffixes consume bits of the prefix/suffix so that
the constant (in my case) is not equal to container size. This
leads to wonky operand/displacement sizes not equal 2^(3+k).
Or, the use of suffix-encodings (which is IMHO worse than prefixI agree with this. Prefixes seem more natural, large numbers expanding
encodings; at least prefix encodings make intuitive sense if one views
the instruction stream as linear, whereas suffixes add weirdness and are >>> effectively retro-causal, and for any fetch to be safe at the end of a
cache line one would need to prove the non-existence of a suffix; so
better to not go there).
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction.
But not "self identified".
I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current
instruction slot plus a following area of the cache-line to the decoder.
Handling suffixes at the end of a cache-line is not too bad if the cache
already handles instructions spanning a cache line. Assume the maximum
number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.
It is easier to handle interrupts with suffixes. The suffix can just be
treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting
for a prefix / suffix.
I would have thought that the previous instruction (last one retired) would provide the starting point of the subsequent instruction. This way you don't have to worry about counting prefixes or suffixes.
On 1/6/2026 5:49 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
<snip>
One would argue that maybe prefixes are themselves wonky, but otherwise >>>> one needs:
Instructions that can directly encode the presence of large immediate
values, etc;
This is the direction of My 66000.
The instruction stream is a linear stream of words.
The first word of each instruction encodes its total length.
What follows the instruction itself are merely constants used as
operands in the instruction itself. All constants are 1 or 2
words in length.
I would not call this means "prefixed" or "suffixed". Generally,
prefixes and suffixes consume bits of the prefix/suffix so that
the constant (in my case) is not equal to container size. This
leads to wonky operand/displacement sizes not equal 2^(3+k).
OK.
As can be noted:
-a XG2/3: Prefix scheme, 1/2/3 x 32-bit
-a-a-a The 96-bit cases are determined by two prefixes.
-a-a-a Requires looking at 2 words to know total length.
-a RV64+Jx:
-a-a-a Total length is known from the first instruction word:
-a-a-a-a-a Base op: 32 bits;
-a-a-a-a-a J21I: 64 bits
-a-a-a-a-a J52I: 96 bits.
-a-a-a There was a J22+J22+LUI special case,
-a-a-a-a-a but I now consider this as deprecated.
-a-a-a-a-a J52I+ADDI is now considered preferable.
As for Imm/Disp sizes:
-a XG1: 9/33/57
-a XG2 and XG3: 10/33/64
-a RV+JX: 12/33/64
For XG1, the 57-bit size was rarely used and only optionally supported, mostly because of the great "crap all of immediate values between 34 and
62 bits" gulf.
Or, the use of suffix-encodings (which is IMHO worse than prefixI agree with this. Prefixes seem more natural, large numbers expanding
encodings; at least prefix encodings make intuitive sense if one views >>>> the instruction stream as linear, whereas suffixes add weirdness and
are
effectively retro-causal, and for any fetch to be safe at the end of a >>>> cache line one would need to prove the non-existence of a suffix; so
better to not go there).
to the left, suffixes seem like a big-endian approach. But I use
suffixes for large constants. I think with most VLI constant data
follows the instruction.
But not "self identified".
Yeah, if you can't know whether or not more instruction follows after
the first word by looking at the first word, this is a drawback.
Also, if you have to look at some special combination of register
specifiers and/or a lot of other bits, this is also a problem.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a I find constant data easier to work with that
way and they can be processed in the same clock cycle as a decode so
they do not add to the dynamic instruction count. Just pass the current
instruction slot plus a following area of the cache-line to the decoder. >>>
Handling suffixes at the end of a cache-line is not too bad if the cache >>> already handles instructions spanning a cache line. Assume the maximum
number of suffixes is present and ensure the cache-line is wide enough.
Or limit the number of suffixes so they fit into the half cache-line
used for spanning.
It is easier to handle interrupts with suffixes. The suffix can just be
treated as a NOP. Adjusting the position of the hardware interrupt to
the start of an instruction then does not have to worry about accounting >>> for a prefix / suffix.
I would have thought that the previous instruction (last one retired)
would
provide the starting point of the subsequent instruction. This way you
don't
have to worry about counting prefixes or suffixes.
Yeah.
My thinking is, typical advance:
-a IF figures out how much to advance;
-a Next instruction gets PC+Step.
Then interrupt:
-a Figure out which position in the pipeline interrupt starts from;
-a Start there, flushing the rest of the pipeline;
-a For a faulting instruction, this is typically the EX1 or EX2 stage.
-a-a-a EX1 if it is a TRAP or SYSCALL;
-a-a-a EX2 if it is a TLB miss or similar;
-a-a-a-a-a Unless EX2 is not a valid spot (flush or bubble),
-a-a-a-a-a-a-a then look for a spot that is not a flush or bubble.
-a-a-a-a-a This case usually happens for branch-related TLB misses.]
Usually EX3 or WB is too old, as it would mean re-running previous instructions.
Getting the exact stage-timing correct for interrupts is a little
fiddly, but worrying about prefix/suffix/etc issues with interrupts
isn't usually an issue, except that if somehow PC ended up pointing
inside another instruction, I would consider this a fault.
Usually for sake of branch-calculations in XG3 and RV, it is relative to
the BasePC before the prefix in the case of prefixed encodings. This
differs from XG1 and XG2 which defined branches relative to the PC of
the following instruction.
Though, this difference was partly due to a combination of
implementation reasons and for consistency with RISC-V (when using a
shared encoding space, makes sense if all the branches define PC displacements in a consistent way).
Though, there is the difference that XG3's branches use a 32-bit scale rather than a 16-bit scale. Well, and unlike RV's displacements, they
are not horrible confetti (*1).
*1: One can try to write a new RV decoder, and then place bets on
whether they will get JAL and Bcc encodings correct on the first try.
IME, almost invariably, one will screw these up in some way on the first attempt. Like, JAL's displacement encoding is "the gift that keeps on giving" in this sense.
Like, they were like:
-a ADDI / Load:
-a-a-a Yay, contiguous bits;
-a Store:
-a-a-a Well, swap the registers around and put the disp where Rd went.
-a Bcc:
-a-a-a Well, take the Store disp and just shuffle around a few more bits;
-a JAL:
-a-a-a Well, now there are some more bits, and Rd is back, ...
-a-a-a Why not keep some of the bits from Bcc,
-a-a-a-a-a but stick everything else in random places?...
-a-a-a Well, I guess some share the relative positions as LUI, but, ...
Not perfect in XG3 either, but still:
-a { opw[5] ? 11'h7FF : 11'h000, opw[11:6], opw[31:16] }
Is nowhere near the same level of nasty...
Well, nevermind if actual decoder has the reverse issue:
In the VL core and JX2VM, it was internally repacked back into XG2 form internally, which means a little bit of hair going on here. Also I was originally going to relocate it in the encoding space, but ended up
moving back to its original location as for reasons (mostly due to
sharing the same decoder) having BRA/BSR in two different locations
would have effectively burned more encoding space than just leaving it
where it had been in XG1/XG2 (even if having BRA/BSR in the F0 block is "kinda stupid" given it is "very much not a 3R instruction", but, ...).
At least in most other instructions, the imm/disp bits remain
contiguous. I instead differed by making the Rn/Rd spot be used as a
source register by some instructions (taking on the role of Rt/Rs2), but
IMO this is the lesser of two evils. Would rather have an Rd that is sometimes a source, than imm/disp fields that change chaotically from
one instruction to another.
...
I attempted to construct an architecture with ordinary variable-length instructions, based on the Concertina II architecture, but with register banks of 16 registers instead of 32. However, it didn't work out; there wasn't enough opcode space for the 16-bit register-to-register
instructions. This surprised me; maybe I can figure something out.
John Savard
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/28/2025 2:04 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken
here...
when I read this, I thought that there was a standard technique for >>>>>>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
A::
ATOMIC-to-Memory-size [address]
ADD Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped
out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is typical.} >>>
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
On 12/28/25 6:53 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/28/2025 2:04 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
John Savard <quadibloc@invalid.invalid> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side)-a-a needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken
here...
when I read this, I thought that there was a standard technique for >>>>>>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.
-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember
something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
A::
ATOMIC-to-Memory-size [address]
ADD Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>> out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is typical.} >>>
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".
This seems to be similar to branch hints and predication in thatExplain
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.
A programmer
might use predication to avoid a branch that is expected to be
poorly predicted or to have more consistent execution time. The
former could be inappropriate for the computer to obey if the
branch predictor became effective for that branch. If prediction
is accurate, predicate prediction could improve performance but
would break execution time consistency. Even reducing execution
time when the predicate is known early might go against the
programmer's intent by leaking information.
If the requesting core has the cache block in exclusive or
modified state, remote execution might be less efficient. Yet it
may also be possible that the block is expected to be moved to
another core such that this pushing the data manipulation to a
"centralized" location would improve performance as was the
programmer intent (rather than avoiding contention overhead). (I
suppose in My 66000, a programmer would use a push/"prefetch"
instruction to move the cache block to a more central location,
but even that might be sub-optimal if the hardware can predict
the next data consumer such that centrally located would be
slower than shifting the data closer to the consumer.)
If the contention is from false sharing (having multiple atomic
data in a cache block seems to be considered bad programming
practice, so this should not be common unless cache block size
grows), hardware could theoretically provide special word caches
(or "lossy" block compression where part of the block is dropped)
for moderating the impact of false sharing. This would change the optimization preferences for the program (more compact data might
be preferred if false sharing is less of a problem).
I do not know what the best interface would be, but it seems that
some care should be taken to account for differing intent when a
programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
hints and directives can have unexpected performance changes
under different microarchitectures or different usage.
Clever microarchitecture can make some optimizations sub-optimal
as well as cause programmers to ask "why didn't you do it the way
I told you to do it?!"
(I think I may not be communicating well. I am kind of tired--- Synchronet 3.21b-Linux NewsLink 1.2
right now and this topic is complicated.)
Paul Clayton <paaronclayton@gmail.com> posted:
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Can recognize:
LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]
Cannot recognize:
LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".
There is a speed of light problem here. Communicating across a
computer <system> is a microsecond time problem, whereas executing instructions is a nanosecond time problem.
And this is exactly where Add-to-Memory gains over Interferable
ATOMIC events--you only pay the latency once, now while the latency
is higher than possible with LL-SC, it is WAY LOWER than worst case
with LL-SC under serious contention.
This seems to be similar to branch hints and predication in thatExplain
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.
A programmer
might use predication to avoid a branch that is expected to be
poorly predicted or to have more consistent execution time. The
My 66000 predication can avoid 2 branches--it operates under the
notion that if FETCH reaches the join point before condition is
known, then predication is always faster than branching.
former could be inappropriate for the computer to obey if the
branch predictor became effective for that branch. If prediction
is accurate, predicate prediction could improve performance but
would break execution time consistency. Even reducing execution
time when the predicate is known early might go against the
programmer's intent by leaking information.
There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.
If the requesting core has the cache block in exclusive or
modified state, remote execution might be less efficient. Yet it
--- Synchronet 3.21b-Linux NewsLink 1.2may also be possible that the block is expected to be moved to
another core such that this pushing the data manipulation to a "centralized" location would improve performance as was the
programmer intent (rather than avoiding contention overhead). (I
suppose in My 66000, a programmer would use a push/"prefetch"
instruction to move the cache block to a more central location,
but even that might be sub-optimal if the hardware can predict
the next data consumer such that centrally located would be
slower than shifting the data closer to the consumer.)
I have been betting that (in the general case) software will
remain incapable of doing such predictions, for quite some time.
If the contention is from false sharing (having multiple atomic
data in a cache block seems to be considered bad programming
practice, so this should not be common unless cache block size
grows), hardware could theoretically provide special word caches
(or "lossy" block compression where part of the block is dropped)
for moderating the impact of false sharing. This would change the optimization preferences for the program (more compact data might
be preferred if false sharing is less of a problem).
I do not know what the best interface would be, but it seems that
some care should be taken to account for differing intent when a programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
hints and directives can have unexpected performance changes
under different microarchitectures or different usage.
Clever microarchitecture can make some optimizations sub-optimal
as well as cause programmers to ask "why didn't you do it the way
I told you to do it?!"
Instruction scheduling is in effect a optimization for one implementation that may not hold for other implementations. In Order pipelines need instruction scheduling, OoO do not, GBOoO generally perform better if
/when instructions are not (or lesser) scheduled.
Loop unrolling is a case where if your machine has vVM the code
runs faster when not {unrolled and executed with branch prediction}
than when {}.
(I think I may not be communicating well. I am kind of tired
right now and this topic is complicated.)
John Savard <quadibloc@invalid.invalid> writes:
On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
Or in other words, if you can decode K-instructions per cycle, you'd
better be able to execute K-instructions per cycle--or you have a
serious blockage in your pipeline.
No.
If you flipped "decode" and "execute" in that sentence above, I would 100% >> agree. And maybe this _is_ just a typo.
But if you actually did mean that sentence exactly as written, I would
disagree. This is why: I regard executing instructions as 'doing the
actual work' and decoding instructions as... some unfortunate trivial
overhead that can't be avoided.
It does not matter what "the actual work" is and what isn't. What
matters is how expensive it is to make a particular part wider, and
how paying that cost benefits the IPC. At every step you add width to
the part with the best benefit/cost ratio.
And looking at recent cores, we see that, e.g., Skymont can decode
3x3=9 instructions per cycle, rename 8 per cycle, has 26 ports to
functional units (i.e., can execute 26 uops in one cycle); I don't
know how many instructions it can retire per cycle, but I expect that
it is more than 8 per cycle.
So the renamer is the bottleneck, and that's also the idea behind
top-down microarchitecture analysis (TMA) for determining how software interacts with the microarchitecture. That idea is coming out of
Intel, but if Intel is finding it hard to make wider renamers rather
than wider other parts, I expect that the rest of the industry also
finds that hard (especially for architectures where decoding is
cheaper), and (looking at ARM A64) where instructions with more
demands on the renamer exist.
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
Paul Clayton <paaronclayton@gmail.com> posted:
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Can recognize:
LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]
Cannot recognize:
LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".
There is a speed of light problem here. Communicating across a
computer <system> is a microsecond time problem, whereas executing
instructions is a nanosecond time problem.
And this is exactly where Add-to-Memory gains over Interferable
ATOMIC events--you only pay the latency once, now while the latency
is higher than possible with LL-SC, it is WAY LOWER than worst case
with LL-SC under serious contention.
This seems to be similar to branch hints and predication in that
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.
Explain
There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.
If the requesting core has the cache block in exclusive or
modified state, remote execution might be less efficient. Yet it
In My 66000 architecture, an atomic operation on memory will have
the property of being executed where the cache line in a modifiable
state happens to reside. Here you don't want to "bring data in"
nor do you want to "push data out" you want the memory hierarchy
to be perturbed as little as possible WHILE performing the ATOMIC
event.
The cost is that each cache will have an adder--with the sizes of
cache these days, said adder is a trifling in area and seldom used
so it is low power.
may also be possible that the block is expected to be moved to
another core such that this pushing the data manipulation to a
"centralized" location would improve performance as was the
programmer intent (rather than avoiding contention overhead). (I
suppose in My 66000, a programmer would use a push/"prefetch"
instruction to move the cache block to a more central location,
but even that might be sub-optimal if the hardware can predict
the next data consumer such that centrally located would be
slower than shifting the data closer to the consumer.)
I have been betting that (in the general case) software will
remain incapable of doing such predictions, for quite some time.
Clever microarchitecture can make some optimizations sub-optimal
as well as cause programmers to ask "why didn't you do it the way
I told you to do it?!"
Instruction scheduling is in effect a optimization for one implementation
that may not hold for other implementations. In Order pipelines need
instruction scheduling, OoO do not, GBOoO generally perform better if
/when instructions are not (or lesser) scheduled.
Loop unrolling is a case where if your machine has vVM the code
runs faster when not {unrolled and executed with branch prediction}
than when {}.
On 2/5/26 4:27 PM, MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
Paul Clayton <paaronclayton@gmail.com> posted:
[snip]
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Can recognize:
LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]
Cannot recognize:
LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional
When would one want to decouple LL and SC into function calls
away from the computation? Perhaps for in-place software
instrumenation?
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".
There is a speed of light problem here. Communicating across a
computer <system> is a microsecond time problem, whereas executing
instructions is a nanosecond time problem.
And this is exactly where Add-to-Memory gains over Interferable
ATOMIC events--you only pay the latency once, now while the latency
is higher than possible with LL-SC, it is WAY LOWER than worst case
with LL-SC under serious contention.
Yes. Even a single adder would have higher throughput than ping-
ponging a cache block. One might even support a three-or-more
input two-or-more result adder to improve throughtput (or
perhaps exploit usually smaller addends) to increase throughput,
though I suspect there would practically never be a case where a
simple adder would have insufficient throughput.
This seems to be similar to branch hints and predication in that
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.
Explain
Branch hints can be intended to reduce branch predictor aliasing
(i.e., assume the static hint is used instead of a dynamic
predictor), to provide agree information, to prefer one path
even if it is less likely, to provide an initialization of the
(per-address component only?) branch predictor, or for some
other motive. The interface/architecture might not be specific
about how such information will be used, especially if it is a
hint (and programmers might disagree about what the best
interface would be). If the interface is not very specific, a microarchitecture might violate a programmer's
desire/expectation by ignoring the hint or using it in a
different way.
Similarly predication can be motivated to avoid fetch
redirection (initial ARM and My 66000), to facilitate constant
time execution, to avoid the performance cost of branch
mispredictions, or perhaps for some reason that does not come to
mind. Predicate prediction would foil constant time execution
and might reduce performance (or merely introduce weird
performance variation). Even the fetch optimization might be
undone if the hardware discovers that the condition is extremely
biased and folds out the rarely used instructions; which would
be good for performance if the bias continues, but if the bias
changes just frequently enough it could hurt performance.
[snip]
There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.
Except that prediction could violate the time constancy assumed
by the programmer.
Paul Clayton <paaronclayton@gmail.com> posted:
On 2/5/26 4:27 PM, MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
Paul Clayton <paaronclayton@gmail.com> posted:
[snip]
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Can recognize:
LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]
Cannot recognize:
LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional
When would one want to decouple LL and SC into function calls
away from the computation? Perhaps for in-place software
instrumenation?
Write, in pure K&R C, the functionality for LoadLocked and
StoreConditional.
[snip]
There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.
Except that prediction could violate the time constancy assumed
by the programmer.
Time constancy is provided by execution both then clause and else clause
and using CMOV to decide on flow.
On 2/9/26 2:28 PM, MitchAlsup wrote:
This means ordinary My 66000 predication cannot be used for
such. I have no idea whether writers of cryptographic software
would be upset that a mechanism which seems like it would
provide constant time is documented not to do so. (I would
_guess_ that most of the embedded strong guarantee real time
software might select hardware that never does predict
predication since caches and similar general purpose
optimizations tend to increase the difficulty of providing
strong timing guarantees.)
On 2/9/26 2:28 PM, MitchAlsup wrote:
Paul Clayton <paaronclayton@gmail.com> posted:
On 2/5/26 4:27 PM, MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
Paul Clayton <paaronclayton@gmail.com> posted:
[snip]
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Can recognize:
LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]
Cannot recognize:
LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional
When would one want to decouple LL and SC into function calls
away from the computation? Perhaps for in-place software
instrumenation?
Write, in pure K&R C, the functionality for LoadLocked and StoreConditional.
Then why would the compiler not inline such? It seems reasonable
(to me) to blame poor scaling performance in that case on the
compiler which did not inline a one instruction function (or on
the developer who intentionally disabled such optimization).
[snip]
[snip]
There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.
Except that prediction could violate the time constancy assumed
by the programmer.
Time constancy is provided by execution both then clause and else clause and using CMOV to decide on flow.
This means ordinary My 66000 predication cannot be used for
such. I have no idea whether writers of cryptographic software
would be upset that a mechanism which seems like it would
provide constant time is documented not to do so.
(I would
_guess_ that most of the embedded strong guarantee real time
software might select hardware that never does predict
predication since caches and similar general purpose
optimizations tend to increase the difficulty of providing
strong timing guarantees.)
Is My 66000 going to have CMOV?
What about something like some
of the AArch64 simple conditional operations? (Yes, such gets
un-RISCy both from count of defined instructions and instruction
complexity, but I have read that the supported operations are
somewhat common for single instruction hammock branches and are
easier to fit microarchitecturally and to fit within a 32-bit
instruction.) The code bloat (8 bytes vs. 4 bytes) and indirect
encoding (i.e., vagueries of idiom recognition rCo though an
encoded instruction can be implemented with performance that
breaks expectations) are disadvantages of just using PRED, but
PRED also removes artificial restrictions like only incrementing
by 1 (such that idiom recognition could be implemented to fast
path any conditional add_immediate).
One interesting case for predication is code like the following:
if (a < 25) {a += 8;}
The predicate is known to be available at least early enough to
nullify the addition executed in parallel with the compare
because the register operand is the same and the only other
operands are immediates. It is not clear if this presents a
useful optimization opportunity, but it seems an interesting
case.
I am also curious about what you view as the trade-offs of
conditional move compared with conditional select. Obviously
conditional select potentially includes a register copy and for
typical OoO implementations the mechanism is similar because a
conditional move renames the destination and reads the old value
and the alternative value. Encoding the condition source may be
harder for conditional select (e.g., test register A is/is not
zero, based on test place register B or register C in register
D, which requires four register names while conditional move
exploits that one of the sources is the same as the
destination); for My 66000 a generalized condition after a
comparison would, I think, require a 6-bit condition bit
specifier and a register source for the condition, which just
fits in a 32-bit instruction (6-bit opcode, 6-bit condition bit
specifier, four 5-bit register names).
Paul Clayton <paaronclayton@gmail.com> posted:<snip>
On 2/9/26 2:28 PM, MitchAlsup wrote:
Write, in pure K&R C, the functionality for LoadLocked and
StoreConditional.
Then why would the compiler not inline such? It seems reasonable
(to me) to blame poor scaling performance in that case on the
compiler which did not inline a one instruction function (or on
the developer who intentionally disabled such optimization).
Because in pure K&R C there is no concept of atomic things, and
thus one has to resort to ASM--beyond this, there is no concept
of inlining.
I do believe that it would be possible in K&R C.
Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.
Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
According to Scott Lurndal <slp53@pacbell.net>:
I do believe that it would be possible in K&R C.
Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.
Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.
According to Scott Lurndal <slp53@pacbell.net>:
I do believe that it would be possible in K&R C.
Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.
Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner comparison loop, but we had enough trouble getting our programs to
work the normal way.
John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
I do believe that it would be possible in K&R C.
Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.
Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.
We knew it was possible to do runtime code generation, since it's an ancient technique widely used by sort programs to speed up the inner comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was trivially comparable.
The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal prefix.
This could well be faster than a complicated sort operator, even if it
was run-time generated, particularly since that runtime code would need
to be called.
Terje
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was >trivially comparable.
The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal >prefix.
According to Terje Mathisen <terje.mathisen@tmsw.no>:
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was
trivially comparable.
The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal
prefix.
I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
deal of work into making it fast.
See, for example:
https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf
According to Terje Mathisen <terje.mathisen@tmsw.no>:
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract >>the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was >>trivially comparable.
The final option was to extract a short prefix (8-bytes) of the key data >>and accept that the full sort would only deliver a partially sorted >>results, so that I had to do a final sort pass over each chunk of equal >>prefix.
I'm pretty sure you'll find all of those in sort programs from the 1960s. >Sorting used more computer time than anything else so they put a great
deal of work into making it fast.
I'm pretty sure you'll find all of those in sort programs from the 1960s. >>Sorting used more computer time than anything else so they put a great
deal of work into making it fast.
Indeed, the sort subsystem on the Burroughs systems was, perhaps,
even more important that the MCP itself to a lot of customers.
Although large sorts in those days were done using magnetic tapes.
(watching a large sort-merge running on a dozen 9-track drives
was pretty impressive).
According to Scott Lurndal <slp53@pacbell.net>:
I'm pretty sure you'll find all of those in sort programs from the 1960s. >>>Sorting used more computer time than anything else so they put a great >>>deal of work into making it fast.
Indeed, the sort subsystem on the Burroughs systems was, perhaps,
even more important that the MCP itself to a lot of customers.
Although large sorts in those days were done using magnetic tapes.
(watching a large sort-merge running on a dozen 9-track drives
was pretty impressive).
Yup. I suspect that the read backward feature on tape drives was invented to >speed up sorting,
My "Algorithms and Data Structures" professor told us that IBM had
patented a zero-time sort chip, which they then promptly buried since
their internal stats said that over all IBM mainframes, 60% of the CPU
hours were spent in various forms of sorting.
Terje
PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was >full) you would run the channel transfer in the opposite direction to >retrieve the sorted data.
In article <10ml9ef$264n$1@gal.iecc.com>, johnl@taugh.com (John Levine) >wrote:
I suspect that the read backward feature on tape drives was
invented to speed up sorting, when someone realized that if
you'd written a set of tapes, you could read them backward
and merge into the next set of tapes in reverse order, repeat
until done.
Emerson W Pugh's _IBM's 360 & Early 370 Systems_ says that was the reason. ><https://www.amazon.co.uk/dp/0262517205>
It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:
My "Algorithms and Data Structures" professor told us that IBM had >patented a zero-time sort chip, which they then promptly buried since >their internal stats said that over all IBM mainframes, 60% of the CPU >hours were spent in various forms of sorting.
Terje
PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was >full) you would run the channel transfer in the opposite direction to >retrieve the sorted data.
Sounds like an urban legend to me. For one thing, on machines of that
era no useful sort fit in core so it spent most of its time doing disk
and tape I/O anyway.
For another, if it really did make a difference, they would have
called it the Improved Sort Performance Facility and sold it as an
extra cost option.
These days z/Series has the optional Enhanced-Sort Facility added in September 2020 which adds a complex SORT LISTS instruction that does in-memory sorts or merges presuably in microcode so it can run at full
memory speed.
According to Terje Mathisen <terje.mathisen@tmsw.no>:
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was
trivially comparable.
The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal
prefix.
I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
deal of work into making it fast.
See, for example:
https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf
Terje Mathisen <terje.mathisen@tmsw.no> posted:
John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
I do believe that it would be possible in K&R C.
Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.
Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was
trivially comparable.
40 years ago, Duff's device was considered a reasonable way to improve performance. It is now considered a means to take over applications
and machines (i.e., a virus).
John Levine <johnl@taugh.com> posted:
It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:
My "Algorithms and Data Structures" professor told us that IBM hadSounds like an urban legend to me. For one thing, on machines of that
patented a zero-time sort chip, which they then promptly buried since
their internal stats said that over all IBM mainframes, 60% of the CPU
hours were spent in various forms of sorting.
Terje
PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was >>> full) you would run the channel transfer in the opposite direction to
retrieve the sorted data.
era no useful sort fit in core so it spent most of its time doing disk
and tape I/O anyway.
I remember hearing about the chip (circa 1986-8).
I agree that the number of entries would not have been enough to justify.
I also agree that Interrupt response time after the I/O would have
prevented any useful performance advantage.
By the mid-1990s the entry count could have been large enough to justify.
For another, if it really did make a difference, they would have
called it the Improved Sort Performance Facility and sold it as an
extra cost option.
These days z/Series has the optional Enhanced-Sort Facility added in
September 2020 which adds a complex SORT LISTS instruction that does
in-memory sorts or merges presuably in microcode so it can run at full
memory speed.
40 years ago, Duff's device was considered a reasonable way to improve
performance. It is now considered a means to take over applications
Was it ever that?
Yes, it was a very compact way to express the same logic you would use
in asm to jump into the correct spot of an unrolled loop, but did it >actually save any time compared to a plain unroll followed by a final fixup?
Duff would never compile into anything strange, but as a way to
obfuscate it could work I guess? Absolutley no need to flag it as >malware/virus!
40 years ago, Duff's device was considered a reasonable way to improve >performance. It is now considered a means to take over applications
and machines (i.e., a virus).
On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
40 years ago, Duff's device was considered a reasonable way to improve >>performance. It is now considered a means to take over applications
and machines (i.e., a virus).
I don't know that it ever really improved performance, but it still is useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup
steps succeed.
I keep track of how many setup steps successfully complete, and then, whatever the exit situation, I jump into a Duff switch that tears it
all down in reverse order of the setup.
George Neuner <gneuner2@comcast.net> writes:
On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
----------------------------------------------------------------40 years ago, Duff's device was considered a reasonable way to improve >>performance. It is now considered a means to take over applications
and machines (i.e., a virus).
I don't know that it ever really improved performance, but it still is useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup steps succeed.
I keep track of how many setup steps successfully complete, and then, whatever the exit situation, I jump into a Duff switch that tears it
all down in reverse order of the setup.
This seems missing an important aspect of Duff switch: the use of the possibility to put case labels inside control structure. Duff's device use
a while starting before the first label, but you can do other strange
things like jumping into an if:
switch(i) {
case 1:
if (j == 3) {
case 2:
printf("i == 2 || (i == 1 && j == 3)\n");
}
}
or is you tears down complex enough to use such things?
Yours,
George Neuner <gneuner2@comcast.net> writes:
On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:
40 years ago, Duff's device was considered a reasonable way to improve >>>performance. It is now considered a means to take over applications
and machines (i.e., a virus).
I don't know that it ever really improved performance, but it still is
useful. I use the switch() form for backing out of situations having
complicated setup where processing isn't possible unless all the setup
steps succeed.
I keep track of how many setup steps successfully complete, and then,
whatever the exit situation, I jump into a Duff switch that tears it
all down in reverse order of the setup.
This seems missing an important aspect of Duff switch: the use of the >possibility to put case labels inside control structure. Duff's device use
a while starting before the first label, but you can do other strange
things like jumping into an if:
switch(i) {
case 1:
if (j == 3) {
case 2:
printf("i == 2 || (i == 1 && j == 3)\n");
}
}
or is you tears down complex enough to use such things?
Yours,
These days z/Series has the optional Enhanced-Sort Facility added in >September 2020 which adds a complex SORT LISTS instruction that does >in-memory sorts or merges presuably in microcode so it can run at full
memory speed.
It appears that Terje Mathisen <terje.mathisen@tmsw.no> said:
My "Algorithms and Data Structures" professor told us that IBM had
patented a zero-time sort chip, which they then promptly buried since
their internal stats said that over all IBM mainframes, 60% of the CPU
hours were spent in various forms of sorting.
Terje
PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was
full) you would run the channel transfer in the opposite direction to
retrieve the sorted data.
Sounds like an urban legend to me. For one thing, on machines of that
era no useful sort fit in core so it spent most of its time doing disk
and tape I/O anyway.
For another, if it really did make a difference, they would have
called it the Improved Sort Performance Facility and sold it as an
extra cost option.
These days z/Series has the optional Enhanced-Sort Facility added in September 2020 which adds a complex SORT LISTS instruction that does in-memory sorts or merges presuably in microcode so it can run at full
memory speed.
John Levine <johnl@taugh.com> writes:
These days z/Series has the optional Enhanced-Sort Facility added in >>September 2020 which adds a complex SORT LISTS instruction that does >>in-memory sorts or merges presuably in microcode so it can run at full >>memory speed.
If the microarchitecture is worth anything, microcode does not run any
faster than architectural code, when doing the same operations.
One possible reason for the additional instruction is that there is a >hardware feature that provides a speedup; in that case the hardware
feature, not the microcode is the reason for the speedup. Is there
such a hardware feature for SORT LISTS, and if so, what is it?
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
John Levine <johnl@taugh.com> writes:
These days z/Series has the optional Enhanced-Sort Facility added in >>September 2020 which adds a complex SORT LISTS instruction that does >>in-memory sorts or merges presuably in microcode so it can run at full >>memory speed.
If the microarchitecture is worth anything, microcode does not run any >faster than architectural code, when doing the same operations.
One possible reason for the additional instruction is that there is a >hardware feature that provides a speedup; in that case the hardware >feature, not the microcode is the reason for the speedup. Is there
such a hardware feature for SORT LISTS, and if so, what is it?
It says here it's a combination of hardware and microcode, but the references are all broken or paywalled. It also says that the instruction can run for quite
a while so it might be that the microcode can run the memory at full speed better
than ordinary instrucions can.
https://blog.share.org/Technology-Article/peeking-under-the-hood-of-sort-acceleration-on-z15
This isn't their first sort speedup. A while ago they added instructions that do
the inner loop of heapsort. Not sure whether there's hardware for that.
One possible reason for the additional instruction is that there is a >>hardware feature that provides a speedup; in that case the hardware >>feature, not the microcode is the reason for the speedup. Is there
such a hardware feature for SORT LISTS, and if so, what is it?
I suspect that the read backward feature on tape drives was
invented to speed up sorting, when someone realized that if
you'd written a set of tapes, you could read them backward
and merge into the next set of tapes in reverse order, repeat
until done.
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 59 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 03:59:49 |
| Calls: | 810 |
| Files: | 1,287 |
| D/L today: |
5 files (10,057K bytes) |
| Messages: | 203,128 |