Forum: Too Lazy BBS

Concertina III Once Again

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 14 08:54:51 2026

From Newsgroup: comp.arch

After realizing that Mitch Alsup was right in that there was no real
benefit in speeding up instruction decode in the manner I was trying to achieve with the use of block headers, I had tried, by going from banks of
32 registers to banks of 16 registers, to move to variable-length instructions.
For some reason, though, I couldn't make it work. It seemed like it
should, but I couldn't get the 16-bit instructions to fit.
Well, I've made another attempt. And it seems like going to banks of 16 registers is indeed sufficient (retaining, from Concertina II, the
artifice of only using seven registers as base registers and another seven
as index registers) to fit an instruction set as complete as the one I'm aiming for in the available opcode space.
Of course, this does give up VLIW functionality. But while VLIW may not be
a true failure, where it works is in small-scale embedded processors. So
I'm not going to worry about attempting to use VLIW as a more conventional alternative to Ivan Godard's more radical Mill design.
With sequential decode, I suppose I could site immediate values after the instruction proper, but I've found that I do not have to do that, I can
have them within the instruction body as normally indicated by its leading bits - with only one bit of awkwardness for the 64-bit immediates.

Concertina III is described at
http://www.quadibloc.com/arch/cy01int.htm

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 14 15:10:50 2026

From Newsgroup: comp.arch

On Thu, 14 May 2026 08:54:51 +0000, quadi wrote:

With sequential decode, I suppose I could site immediate values after
the instruction proper, but I've found that I do not have to do that, I
can have them within the instruction body as normally indicated by its leading bits - with only one bit of awkwardness for the 64-bit
immediates.

I was able to correct for that. The correction came at a price, but I
believe the price is acceptable: I can't add any more types of
instructions that are 80 bits long or longer; but in return, not only are
the doubleword immediates only 80 bits long, but the quadword immediates
are only 144 bits long; neither one needs to be padded out by an extra 16 bits.

Concertina III is described at http://www.quadibloc.com/arch/cy01int.htm

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 14 15:40:18 2026

From Newsgroup: comp.arch

I added the byte immediates to the diagram, and also I found I had opcode space for the supervisor call as a 16-bit instruction... with enough additional space to also restore the opcode space to be effectively
unbounded, since there is also enough 16-bit opcode space for a family of
256 16-bit instruction prefixes.

Concertina III is described at
http://www.quadibloc.com/arch/cy01int.htm

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 14 21:41:59 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

After realizing that Mitch Alsup was right in that there was no real
benefit in speeding up instruction decode in the manner I was trying to achieve with the use of block headers, I had tried, by going from banks of 32 registers to banks of 16 registers, to move to variable-length instructions.

See Gould S.E.L 32/87 (or /67) for ideas to save a few bits here and there along the lines of base registers and register segmentation.

For some reason, though, I couldn't make it work. It seemed like it
should, but I couldn't get the 16-bit instructions to fit.

Well, I've made another attempt. And it seems like going to banks of 16 registers is indeed sufficient (retaining, from Concertina II, the
artifice of only using seven registers as base registers and another seven as index registers) to fit an instruction set as complete as the one I'm aiming for in the available opcode space.

Of course, this does give up VLIW functionality. But while VLIW may not be
a true failure, where it works is in small-scale embedded processors. So
I'm not going to worry about attempting to use VLIW as a more conventional alternative to Ivan Godard's more radical Mill design.

Is there any "real" or even "useful" advantage of VLIW ??? Given the number
of attempts and no real long-lasting results, history should be your guide.

With sequential decode, I suppose I could site immediate values after the instruction proper, but I've found that I do not have to do that, I can
have them within the instruction body as normally indicated by its leading bits - with only one bit of awkwardness for the 64-bit immediates.

In K9, we used a packet cache of 8 instructions per fetch, and used a
scheme called "vertical neighbor" to hold non-8-bit immediates.

In Mc88120 we just executed the SETHI and OP instructions to paste bits together.

The experience of both led me to My 66000 that simply appends constants
to the instructions (1 constant per 1 instruction). The VLI decoder is
6 gates and 2 gates of delay.) this has worked out so well, that I
encourage others to follow suit (or outright copy...)

Concertina III is described at
http://www.quadibloc.com/arch/cy01int.htm

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 14 22:11:26 2026

From Newsgroup: comp.arch

On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

The experience of both led me to My 66000 that simply appends constants
to the instructions (1 constant per 1 instruction). The VLI decoder is 6 gates and 2 gates of delay.) this has worked out so well, that I
encourage others to follow suit (or outright copy...)

That is an approach which does have an important advantage. Right now, I
only have immediates for the basic integer and floating-point operations.
What about decimal floating-point immediates, for example? Appending them
to the instruction can be simple and orthogonal.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu May 14 22:13:24 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

quadi <quadibloc@ca.invalid> posted:

<snip>

With sequential decode, I suppose I could site immediate values after the
instruction proper, but I've found that I do not have to do that, I can
have them within the instruction body as normally indicated by its leading >> bits - with only one bit of awkwardness for the 64-bit immediates.

In K9, we used a packet cache of 8 instructions per fetch, and used a
scheme called "vertical neighbor" to hold non-8-bit immediates.

In Mc88120 we just executed the SETHI and OP instructions to paste bits >together.

The experience of both led me to My 66000 that simply appends constants
to the instructions (1 constant per 1 instruction). The VLI decoder is
6 gates and 2 gates of delay.) this has worked out so well, that I
encourage others to follow suit (or outright copy...)

That was how the B3500 worked. The arithmetic instructions
were three-operand with two source operands and a destination
operand. (the DIV instruction produced both the quotient and
the remainder in the destination field).

The first operand could be a small constant (4 to 24 bits - 1 to 6 BCD digits) or
the address of the operand in memory (optionally indexed). The
remaining two operands were memory operands.

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri May 15 02:28:27 2026

From Newsgroup: comp.arch

On 5/14/2026 4:41 PM, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

After realizing that Mitch Alsup was right in that there was no real
benefit in speeding up instruction decode in the manner I was trying to
achieve with the use of block headers, I had tried, by going from banks of >> 32 registers to banks of 16 registers, to move to variable-length
instructions.

See Gould S.E.L 32/87 (or /67) for ideas to save a few bits here and there along the lines of base registers and register segmentation.

For some reason, though, I couldn't make it work. It seemed like it
should, but I couldn't get the 16-bit instructions to fit.

Well, I've made another attempt. And it seems like going to banks of 16
registers is indeed sufficient (retaining, from Concertina II, the
artifice of only using seven registers as base registers and another seven >> as index registers) to fit an instruction set as complete as the one I'm
aiming for in the available opcode space.

Of course, this does give up VLIW functionality. But while VLIW may not be >> a true failure, where it works is in small-scale embedded processors. So
I'm not going to worry about attempting to use VLIW as a more conventional >> alternative to Ivan Godard's more radical Mill design.

Is there any "real" or even "useful" advantage of VLIW ??? Given the number of attempts and no real long-lasting results, history should be your guide.

I am left to concede here as well:
Both of my major ISA families (BJX1 and BJX2) had used VLIW;
For XG3, I ended up abandoning it in favor of superscalar.

The cost delta between VLIW and superscalar being not really enough to
justify the hassles that VLIW brings to the table.

There are still areas I have reservations:
Coherent caches vs weak caches;
Whether to have hardware partially take over for TLB management;
Rather than the current system of TLB Miss and ACL Miss traps.
...

Then, there are costs that I think are worth paying, but others disagree:
Supporting Indexed Load/Store addressing;
Supporting misaligned-safe memory access (at least for smaller types);
Ability to have encodings with large immediate/displacement fields;
...

But, more because each of this opens up a strong positive use-case while avoiding a semi-common adverse case:
1. Making performance in Doom and similar not suck;
2. Faster Huffman, Faster LZ, faster string functions, ...
3. Not taking a hit pretty much every time an Imm/Disp fails to hit.
...

Whereas, for cache coherence:
Some approaches to multi-CPU multi-threading work, but they are ones
that tend to "perform like hot garbage" even on clever chips, when they
do work (*1).

*1: Like, if people try to write code in a way than makes use of
coherent caches on x86-64 systems, performance kinda tanks. And, the way
to avoid performance tanking, is to write it like how it would work on incoherent caches.

Well, if you can get the OS to not just schedule all of the threads on
the same core that is; which is ironically, the same workaround one
would use for incoherent caches.

Goes and looks into it:
Apparently this was because the cache hierarchy works in such a way that
it was faster to schedule all of the threads on the same core than to
risk dealing with the performance penalties of coherence handling
between different cores. ... Yeah ...

Well, I guess I can count it lucky in one way:
At least the CPU I am running now has an integer divide that isn't
abysmally slow. Because apparently its direct predecessors had
implemented integer divide via microcode or something.

Well, and MS left me alone WRT the whole Win11 thing as they consider my
CPU to be "too old" (apparently not supporting anything much older than
Zen2 or similar).

Then again, recently another of the computers around that was still
running a Phenom II has stopped working. Was working well enough, until
it didn't work at all.

Causes vary, sometimes it seems like capacitors are prone to release
their goo and similar, etc...

For a short while, there were a bunch of cheap "Dell OptiPlex" computers around (got a few of these), but annoyingly they stopped being so cheap
(at this rate, should have probably ordered multiple of them last year
when the price was extra low, but, alas...).

Then again, seemingly even an lowly Core i3 in an OptiPlex could still
hold its own against a Phenom II or an Athlon X2. Then again, even being
cheap refurbs, they still somehow manage to have semi-new hardware (~ 2018..2020).

Does seem sometimes like Intel CPUs have somewhat different performance characteristics (and seemingly better Perf/MHz), but I don't know all of
the specifics as to why.

With sequential decode, I suppose I could site immediate values after the
instruction proper, but I've found that I do not have to do that, I can
have them within the instruction body as normally indicated by its leading >> bits - with only one bit of awkwardness for the 64-bit immediates.

In K9, we used a packet cache of 8 instructions per fetch, and used a
scheme called "vertical neighbor" to hold non-8-bit immediates.

In Mc88120 we just executed the SETHI and OP instructions to paste bits together.

The experience of both led me to My 66000 that simply appends constants
to the instructions (1 constant per 1 instruction). The VLI decoder is
6 gates and 2 gates of delay.) this has worked out so well, that I
encourage others to follow suit (or outright copy...)

In my case, I used a prefix scheme:
Op: Decode Imm directly;
Prefix+Op: Route Prefix bits into Op decoder, decode Op;
2x Prefix + Op:
Decode as Prefix+Op case;
The 2x Prefix case produces a second output holding (63:32).

Had used a vaguely similar approach for the J21I/J52I prefixes I had
glued onto RISC-V.

Goal being that the part that decodes (63:32) shouldn't need to care
about what is going on in the final Op, and vice-versa (mostly).

Maybe could have been done better.
In XG3, I started working on replacing the J_OP+Imm16 => Imm32/Imm33 encodings, mostly because it could be preferable (for saving cost) to
maybe later allow these to be dropped.

Similarly, working towards phasing out / deprecating J_IMM+J_IMM+Imm16
for similar reasons. As it has separate decode-path logic from the J_IMM+J_IMM+Imm10 case, and it could be preferable to formalize on the
latter. Though, for XG3, both XG1/XG2 still needing the original
encodings. Gradually direction though is to allow XG3 to be independent
of XG1 and XG2.

Many cases could be replaced by Imm33 synthesis cases, but a few cases
got wacky (and ended up adding stuff).

MOVLD Rm, Imm32u, Rn // ~ PACK in RV
MOVHD Rm, Imm32u, Rn // ~ PACKU in RV

Becoming effectively one of 4 patterns:
MOVLD Rm, Imm32u, Rn // { Rm[31: 0], Imm[31:0] }
MOVHD Rm, Imm32u, Rn // { Rm[63:32], Imm[31:0] }
MOVLD Imm32u, Rm, Rn // { Imm[31:0], Rm[31: 0] } (notation TBD)
MOVHD Imm32u, Rm, Rn // { Imm[31:0], Rm[63:32] }

This can replace both the "MOVHI32 Imm32, Rn" and "SHORI32 Imm32, Rn" encodings, and a few other cases.

These patterns didn't strictly emerge naturally from "stick a jumbo
prefix onto the existing instruction" case, as the decoder needs to
special case how the instructions are decoded if there is a prefix.

These expanding on the original MOVLD/MOVHD instructions.

This particular trick wouldn't work on RV+Jx though...

Things might be different though had all of this been done "entirely
clean" vs incremental mutation though.

Concertina III is described at
http://www.quadibloc.com/arch/cy01int.htm

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Fri May 15 15:58:03 2026

From Newsgroup: comp.arch

On Thu, 14 May 2026 22:11:26 +0000, quadi wrote:

On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

The experience of both led me to My 66000 that simply appends constants
to the instructions (1 constant per 1 instruction). The VLI decoder is
6 gates and 2 gates of delay.) this has worked out so well, that I
encourage others to follow suit (or outright copy...)

That is an approach which does have an important advantage. Right now, I
only have immediates for the basic integer and floating-point
operations.
What about decimal floating-point immediates, for example? Appending
them to the instruction can be simple and orthogonal.

However, while it is simple and orthogonal, I didn't like not having a
unified scheme of decoding the lengths of instructions.
I was able to re-organize the opcode space for instructions longer than 32 bits so as to be able to have both minimal-length immediate instructions
for the basic operations and data types, and additional immediate
instructions which are 16 bits longer for additional operands and data
types, so I went that way.
Another issue is that in Concertina II, while an additional bit indicated
a pseudo-immediate, the register field did not go to waste; it was used to indicate the position of the pseudo-immediate in the block. So appending immediates would have been a temptation to decree that register 15 or
register 0 couldn't be the source for a register operand, which would be
bad.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sat May 16 03:57:06 2026

From Newsgroup: comp.arch

On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.

As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor.
Of course, though, he is hardly a disinterested source.

But the idea that putting bits in instructions to indicate that they can
be executed in parallel can enhance pipelining without the huge overhead
of out-of-order execution seems plausible to me. It's the same sort of argument that Ivan Godard made for his innovative Mill design. You've
noted, though, that unlike register hazards, cache misses, which are unpredictable by compilers, can be handled by a simpler form of OoO, the scoreboard of the 6600.

In a way, Concertina II is VLIW "perfected" - by putting the bits that indicate parallelism in a header at the start of the block, the price of indicating parallelism isn't a shorter instruction word, and hence having
to make do with fewer registers, or shorter displacement fields, all
things that do have an obvious negative impact on performance.

And by going from the block-oriented Concertina II design to the variable- length instruction Concertina III design, I've gone from banks of 32
registers to banks of 16 registers!

Did I have to do this?

In Concertina III, instructions longer than 32 bits take up 1/16 of the
opcode space. Adding a bit so as to use 32 registers instead of 16 would change that to 1/8.

In Concertina II, the 32-bit instructions take up about 3/4 of the opcode space.

So an ISA without block structure, with variable-length instructions
instead, with banks of 32 registers is possible! However, only 1/8 of the opcode space would be left for short instructions, and 16-bit instructions with only 13 bits available... would be largely useless. If having the
option of using 16-bit instructions is the primary benefit of having variable-length instructions, instead of every instruction being 32 bits long... then attempting to obtain the best of Concertina II and III in a single design through this artifice... which seems so very tempting... is
a mistake.

Of course, the 360 managed to get by quite well with only 1/4 of the opcode space used by 16-bit instructions. Could 14 bits be useful where 13 bits
are doomed to fail, and if so, what contrivance could I possibly use to squeeze out that extra opcode space... since I've tried, and abandoned as fatally flawed, a _lot_ of contrivances to squeeze out space in just that
way in the development of Concertina II?

Block structure had the advantage of letting me pack more bits in instructions. That it let me offer VLIW, in the sense of controlling
parallel execution, as an option... was just gravy.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sat May 16 04:13:01 2026

From Newsgroup: comp.arch

On Sat, 16 May 2026 03:57:06 +0000, quadi wrote:

Could 14 bits be useful where
13 bits are doomed to fail,

Actually, though, I had worked out two ways where 16 bit short
instructions that all must start with 111 could perhaps do useful work.

The first one was:

111 + (seven bit opcode) + (3) + (3)

Just have operate instructions that only use the first eight registers.

And then I came up with an alternative:

111 + (seven bit opcode) + (1) + (5)
11111 + (seven bit opcode) + (3) + (1)

since I'm only using 96 opcodes, not 128.

In the primary format, only registers 0 and 1 are destination registers,
but all 32 registers are source registers.

The secondary format tries to balance that out by letting results in those
two accumulators participate in operations with the first eight registers
as destination registers. So those first two registers are still a
bottleneck, but the need to add extra operations to move results out of
those registers is, hopefully, reduced.

But as far as I know, nobody has tried to design an ISA this way, so
nobody has tried to figure out how to write a compiler to make effective
use of such a design.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sat May 16 13:13:59 2026

From Newsgroup: comp.arch

On Sat, 16 May 2026 04:13:01 +0000, quadi wrote:

On Sat, 16 May 2026 03:57:06 +0000, quadi wrote:

Could 14 bits be useful where 13 bits are doomed to fail,

Actually, though, I had worked out two ways where 16 bit short
instructions that all must start with 111 could perhaps do useful work.

The first one was:

111 + (seven bit opcode) + (3) + (3)

I have finally realized that there is a way to turn the impossible goal
that seemed so tantalizingly close to achievement into something possible.

Just add

11111 +
(break bit) +
(seven-bit opcode) +
(condition code bit) +
(five-bit destination register) +
(five-bit source register)

and there you have it. An operate instruction that has five-bit source and destination register fields, and is shorter than 32 bits.

What's that? It isn't sixteen bits long! No, it isn't. But if each
instruction indicates how long it is with its starting bits, and then one looks for the next instruction where that one ends... then instructions
can start anywhere.

Well, at least the last bit in the displacement field of a jump
instruction is no longer going to waste.

I wanted to follow the illustrious example of the 68000 and the
System/360, instead of the disaster that is x86, but if 24-bit short instructions are the price of having register banks with 32 registers -
*and* continuing to have the option of 12-bit displacements and 20-bit displacements - then it has to be paid.

Yes. Concertina IV is coming. Be very afraid?

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sat May 16 17:35:04 2026

From Newsgroup: comp.arch

According to quadi <quadibloc@ca.invalid>:

On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.

As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor. >Of course, though, he is hardly a disinterested source.

It works great in programs where the compiler can predict the sequence of memory
references at compile time, much less well when the sequence is data dependent.

I can believe that video processing falls into the first category.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat May 16 12:38:58 2026

From Newsgroup: comp.arch

On 5/15/2026 10:57 PM, quadi wrote:

On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.

As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor.
Of course, though, he is hardly a disinterested source.

But the idea that putting bits in instructions to indicate that they can
be executed in parallel can enhance pipelining without the huge overhead
of out-of-order execution seems plausible to me. It's the same sort of argument that Ivan Godard made for his innovative Mill design. You've
noted, though, that unlike register hazards, cache misses, which are unpredictable by compilers, can be handled by a simpler form of OoO, the scoreboard of the 6600.

It is more VLIW vs In-Order, and In-Order vs OoO.

VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags.
+ Code does not depend on uArch.

In-Order vs Out-of-Order:
In-Order:
+ Simpler hardware
- Not as fast
OoO:
- Complex hardware (reorder buffer, scoreboard/renamer, ...)
+ Faster

Both VLIW and In-Order benefit from a large register file.
OoO mostly benefits ISA designs that would otherwise be slow.
Mostly absorbing the cost of a lot of the ISA level inefficiencies.

Theoretically, OoO can better absorb cache misses, however my own
testing implies that the delta vs "cache miss results in pipeline stall"
vs "delay instruction to hide miss" appears to be mostly negligible.

Also raw CPU speed doesn't matter as much when the computation is
primarily limited by RAM bandwidth or latency (seems to be a pretty
common scenario IME).

In my case, I had realized that In-Order could be handled nearly exactly
the same as my prior LIW handling (no real changes needed to the
pipeline, etc), with the primary change that the I$ can have logic to
detect which instructions can run in parallel during cache line fetch,
and doing this is in-effect cheap enough to be worthwhile (the in-order
not adding any significant resource cost over LIW).

So, in my case, 16 byte cache lines, in Op0..Op3:
Can Op0 co-execute with Op1?
Can Op1 co-execute with Op2?
Can Op2 co-execute with Op3?
Can Op0/1/2 co-execute?
Can Op1/2/3 co-execute?

Not too unreasonable.
Implementation currently can't deal with 16-bit ops, checks across cache lines, misaligned ops, mixed RV and XG3 sequences, ... But, still mostly
works reasonably OK. An implementation that dealt with all of these edge
cases (such as to not take a significant performance hit with RV-C)
would have added a little more cost though.

The co-mixed RV and XG3 scenario was mostly limited because checking for register aliases between the mismatched register fields (reg5/reg6) was
more expensive than ideal. So, cheaper to only check between ops of the
same type and assume that mismatched ops may potentially have a
register-alias (even when they don't).

So, say, superscalar logic was a lookup over opcode bits for flags like:
can this op run in Lane 2?
Can this op run in Lane 3?
Can this op run with another op in Lane 2?
Can this op run with another op in Lane 3?
Does this op use Rd as a source?
Does this op use Rt as a source?
...
Then, say, checks between register fields:
Rd0==Rs1, Rd0==Rt1
Rd1==Rs0, Rd1==Rt0
...

Then, feed all of these bits through a few lookups, reducing it to the
"Can Op0/Op1 co-execute?" question.

Not free, but fast enough to be handled when a new cache line arrives
(cache line generally being mode-tagged, etc).

...

In a way, Concertina II is VLIW "perfected" - by putting the bits that indicate parallelism in a header at the start of the block, the price of indicating parallelism isn't a shorter instruction word, and hence having
to make do with fewer registers, or shorter displacement fields, all
things that do have an obvious negative impact on performance.

And by going from the block-oriented Concertina II design to the variable- length instruction Concertina III design, I've gone from banks of 32 registers to banks of 16 registers!

Did I have to do this?

In Concertina III, instructions longer than 32 bits take up 1/16 of the opcode space. Adding a bit so as to use 32 registers instead of 16 would change that to 1/8.

In Concertina II, the 32-bit instructions take up about 3/4 of the opcode space.

So an ISA without block structure, with variable-length instructions
instead, with banks of 32 registers is possible! However, only 1/8 of the opcode space would be left for short instructions, and 16-bit instructions with only 13 bits available... would be largely useless. If having the
option of using 16-bit instructions is the primary benefit of having variable-length instructions, instead of every instruction being 32 bits long... then attempting to obtain the best of Concertina II and III in a single design through this artifice... which seems so very tempting... is
a mistake.

Of course, the 360 managed to get by quite well with only 1/4 of the opcode space used by 16-bit instructions. Could 14 bits be useful where 13 bits
are doomed to fail, and if so, what contrivance could I possibly use to squeeze out that extra opcode space... since I've tried, and abandoned as fatally flawed, a _lot_ of contrivances to squeeze out space in just that
way in the development of Concertina II?

Block structure had the advantage of letting me pack more bits in instructions. That it let me offer VLIW, in the sense of controlling
parallel execution, as an option... was just gravy.

Bits, elsewhere, are still bits.
And, things like the pigeon principle and similar still apply.

Now you have mostly just added the issue that either there is a spot
that can't be used and needs to be skipper over (per-block), the number
of instructions per block is NPOT and/or the instruction size is NPOT.

One could try making the memory blocks NPOT, but this itself adds suck.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat May 16 12:58:28 2026

From Newsgroup: comp.arch

On 5/16/2026 12:35 PM, John Levine wrote:

According to quadi <quadibloc@ca.invalid>:

On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.

As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor. >> Of course, though, he is hardly a disinterested source.

It works great in programs where the compiler can predict the sequence of memory
references at compile time, much less well when the sequence is data dependent.

I can believe that video processing falls into the first category.

I suspect personally (based on behaviors I have seen in existing consumer-grade processors):
It can either be predicted in advance;
Or, it can't reliably be predicted at all.

Seemingly, if comparing modern fancy CPUs with designs with a
competently designed ISAs (but In-Order):
I haven't usually seen all that strong of a divergence between in-order
and out-of-order results in various benchmarks (when excluding those
that are determined primarily by "How fast does the RAM go?").

Like, seemingly everything mostly scales fairly linearly with
clock-speed, and with OoO seemingly only gaining a fairly minor bump.

Well, at least if excluding things like:
Short/tight loop with stupidly complex arithmetic expression.
OoO does pretty well at these...

But, then, this is a coding style that is better off not used, because
it often performs poorly in-general. And, seemingly, when writing code
in ways that perform well in general, much of the advantages seemingly evaporate (well, and/or it becomes RAM speed bound, whichever happens
first).

Like, big fancy/expensive tool that mostly compensates for some
combination of poor ISA and poorly optimized code.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 16 18:23:32 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 5/15/2026 10:57 PM, quadi wrote:

On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.

As I've noted, "the guy who invented VLIW" claimed, in a YouTube video, that a VLIW processor is highly successful as an embedded video processor. Of course, though, he is hardly a disinterested source.

But the idea that putting bits in instructions to indicate that they can
be executed in parallel can enhance pipelining without the huge overhead
of out-of-order execution seems plausible to me. It's the same sort of argument that Ivan Godard made for his innovative Mill design. You've noted, though, that unlike register hazards, cache misses, which are unpredictable by compilers, can be handled by a simpler form of OoO, the scoreboard of the 6600.

It is more VLIW vs In-Order, and In-Order vs OoO.

VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags.
+ Code does not depend on uArch.

In-Order vs Out-of-Order:
In-Order:
+ Simpler hardware
- Not as fast
OoO:
- Complex hardware (reorder buffer, scoreboard/renamer, ...)
+ Faster

S/Faster/Higher Performing/

Both VLIW and In-Order benefit from a large register file.
OoO mostly benefits ISA designs that would otherwise be slow.
Mostly absorbing the cost of a lot of the ISA level inefficiencies.

Theoretically, OoO can better absorb cache misses, however my own
testing implies that the delta vs "cache miss results in pipeline stall"
vs "delay instruction to hide miss" appears to be mostly negligible.

You do not have an execution pipeline with depth > L1 cache miss
latency. When you do, new effects become feasible--like beginning
the second next loop iteration before the first one has completed.
This is where you can now absorb the L1 cache miss latency.

Also raw CPU speed doesn't matter as much when the computation is
primarily limited by RAM bandwidth or latency (seems to be a pretty
common scenario IME).

In my case, I had realized that In-Order could be handled nearly exactly
the same as my prior LIW handling (no real changes needed to the
pipeline, etc), with the primary change that the I$ can have logic to
detect which instructions can run in parallel during cache line fetch,

When you do not have condition codes, and only 1 register file, you
can determine parallel-ness by simply looking at the registers.

and doing this is in-effect cheap enough to be worthwhile (the in-order
not adding any significant resource cost over LIW).

So, in my case, 16 byte cache lines, in Op0..Op3:
Can Op0 co-execute with Op1?

When Rd-1 ~= either{SRC1-2, or SRC2-2}

Can Op1 co-execute with Op2?

When Rd-1 ~= either{SRC1-3, or SRC2-3}

Can Op2 co-execute with Op3?

When Rd-2 ~= either{SRC1-3, or SRC2-3}

Can Op0/1/2 co-execute?
Can Op1/2/3 co-execute?
----------
So, say, superscalar logic was a lookup over opcode bits for flags like:
can this op run in Lane 2?

Depends on what is in Lane 2

Can this op run in Lane 3?

...

Can this op run with another op in Lane 2?
Can this op run with another op in Lane 3?
Does this op use Rd as a source?
Does this op use Rt as a source?

Given nomenclature like Mc88120 where {
Lanes = {MEM0, MEM1, MEM2, FADD, FMUL, Branch}
And MEM has an integer unit, and a shift unit
FADD has an integer unit
FMUL has an integer unit
Branch has an integer unit }
And each unit is buffered with its own reservation station;
You just let the RSs create a solution.

Given nomenclature like M5 with >10 FUs, the calculation is harder,
but you still just let the RSs create the solution.
--------------
--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat May 16 20:48:45 2026

From Newsgroup: comp.arch

John Levine wrote:

According to quadi <quadibloc@ca.invalid>:

On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be
your guide.

As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor. >> Of course, though, he is hardly a disinterested source.

It works great in programs where the compiler can predict the sequence of memory
references at compile time, much less well when the sequence is data dependent.

I can believe that video processing falls into the first category.

Feel free to believe so, then read up on the CABAC h.264 encoding:

This is an arithmetic compression setup where you for every bit decoded
have to make a branch to separate code using a different context (that
is the context-adaptive binary arithmetic coding which gave the acronym).

What it means is that any sw decoder will have a 50% branch where you
cannot "simply" execute both parts in parallel, or use the same code
just with context-dependent table lookups.

It is fine for HW, pretty much pessimal for SW.

It works due to two factors: (a) Most/many videos use the much more SW-friendly alternative encoding which provides a few less percent
compression rate but at comparably lower encode/decode cost, and (b) cpu vendors like Intel license a chunk of VLSI intellectual property which
does major parts (or all?) in hardware, mostly because it also saves a
ton of power, allowing a cell phone or laptop to play video without
running out of battery power long before the film ends.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 17 15:24:20 2026

From Newsgroup: comp.arch

On Sat, 16 May 2026 13:13:59 +0000, quadi wrote:

On Sat, 16 May 2026 04:13:01 +0000, quadi wrote:

The first one was:

111 + (seven bit opcode) + (3) + (3)

I have finally realized that there is a way to turn the impossible goal
that seemed so tantalizingly close to achievement into something
possible.

Just add

11111 +
(break bit) +
(seven-bit opcode) +
(condition code bit) +
(five-bit destination register) +
(five-bit source register)

The thing is, though, that in Concertina IV, I want to bring back some
things that Concertina III, with banks of 16 registers, had to give up. 20-
bit long displacements, for one, and extended register banks of 128
registers for another.

So I need additional opcode space for 48-bit instructions.

I have come up with a place to find it.

Just drop the 16-bit short instructions entirely, as, being confined to
the first eight registers, they're not very useful (? actually, they could
be quite useful, unless code that isn't spread out in a large register
bank hits performance badly, which is likely to be the case in a typical Concertina IV implementation, given that its design is closer to that of Concertina II than III) keeping only the 24-bit short instructions. That almost doubles the opcode space left for instructions larger than 32 bits.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun May 17 15:37:47 2026

From Newsgroup: comp.arch

On 5/16/2026 1:23 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 5/15/2026 10:57 PM, quadi wrote:

On Thu, 14 May 2026 21:41:59 +0000, MitchAlsup wrote:

Is there any "real" or even "useful" advantage of VLIW ??? Given the
number of attempts and no real long-lasting results, history should be >>>> your guide.

As I've noted, "the guy who invented VLIW" claimed, in a YouTube video,
that a VLIW processor is highly successful as an embedded video processor. >>> Of course, though, he is hardly a disinterested source.

But the idea that putting bits in instructions to indicate that they can >>> be executed in parallel can enhance pipelining without the huge overhead >>> of out-of-order execution seems plausible to me. It's the same sort of
argument that Ivan Godard made for his innovative Mill design. You've
noted, though, that unlike register hazards, cache misses, which are
unpredictable by compilers, can be handled by a simpler form of OoO, the >>> scoreboard of the 6600.

It is more VLIW vs In-Order, and In-Order vs OoO.

VLIW vs In-Order:
VLIW:
+ Slightly cheaper logic;
- Binary depends more on processor specifics.
In-Order:
- Needs logic to handle register deps and lookup opcode flags.
+ Code does not depend on uArch.

In-Order vs Out-of-Order:
In-Order:
+ Simpler hardware
- Not as fast
OoO:
- Complex hardware (reorder buffer, scoreboard/renamer, ...)
+ Faster

S/Faster/Higher Performing/

Both VLIW and In-Order benefit from a large register file.
OoO mostly benefits ISA designs that would otherwise be slow.
Mostly absorbing the cost of a lot of the ISA level inefficiencies.

Theoretically, OoO can better absorb cache misses, however my own
testing implies that the delta vs "cache miss results in pipeline stall"
vs "delay instruction to hide miss" appears to be mostly negligible.

You do not have an execution pipeline with depth > L1 cache miss
latency. When you do, new effects become feasible--like beginning
the second next loop iteration before the first one has completed.
This is where you can now absorb the L1 cache miss latency.

OK. I am not sure what mainstream CPUs are doing here.

Mainly I had been running AMD chips, but had often failed to see much
that puts them clearly outside the realm of what could be expected from extrapolating the clock speeds and adding some minor fudge-factors
relative to modeling an in-order design (*1).

There is a difference for Intel CPUs though, which generally appear to
give better perf per clock in various cases, and have performance
behaviors which don't really match up to modeling them as an in-order
machine.

*1: Though not necessarily one running x86...

One needs to assume significantly more registers and that store+reload
to the same address behaves like a register MOV and similar. But, it is
like, if one replaces x86 with a RISC-style ISA, the overall performance behavior can match up pretty well.

Otherwise. one may also need to assume the ability to co-issue memory
loads.

Say, for example, if one were to pretend that a Piledriver sort of
looked like:
64 registers
3 ALUs
1 LD/ST per clock (with a 4 cycle load latency)
24 cycle branch mispredict
1 SIMD op per clock
...

And, Zen+ sorta like, say:
4 ALUs
2 LD / 1 ST per clock (4c load)
12 cycle branch mispredict
1 SIMD op per clock
...

But, as noted, this approach seems to fall apart with Intel CPUs, which
seem to diverge more noticeably from predictions one could make based on assuming an in-order model (and not as easily modeled in general).

If one assumes a 32K direct-mapped L1 + 4 way victim cache and 32-byte
cache lines, this also seems to match up reasonably well. Like, it isn't
quite as smooth as a 4-way associative cache, nor as poorly behaved as a
plain direct-mapped L1.

But, the DM L1 + VC approach was based on my own design efforts, but
does still appear curiously close to benchmarks run on PC class
hardware. In my case, I went with a 4-way VC mostly for cost
optimization (gains from going 8-way were small, but cost was steep).

But, it is strange in a way, as AFAIK the x86 chips have native set-associative caches.

Do need to assume a high associativity for the L2 caches though (unlike
my designs which had used a direct-mapped L2 and the VC was more to
reduce "damage" caused to the L2 cache by the L1 conflict misses, which
were comparably more expensive in L2 land).

One difference being that for the PC, one needs to assume that
load/store ordering is preserved between cores, but to be modeled one
can add around an 80 cycle or so penalty for every time a line is
modified on one core and then read on another.

...

Also raw CPU speed doesn't matter as much when the computation is
primarily limited by RAM bandwidth or latency (seems to be a pretty
common scenario IME).

In my case, I had realized that In-Order could be handled nearly exactly
the same as my prior LIW handling (no real changes needed to the
pipeline, etc), with the primary change that the I$ can have logic to
detect which instructions can run in parallel during cache line fetch,

When you do not have condition codes, and only 1 register file, you
can determine parallel-ness by simply looking at the registers.

Yes.

There is a little more though depending on "how" the instructions may be
used in a core where not all lanes are the same.

and doing this is in-effect cheap enough to be worthwhile (the in-order
not adding any significant resource cost over LIW).

So, in my case, 16 byte cache lines, in Op0..Op3:
Can Op0 co-execute with Op1?

When Rd-1 ~= either{SRC1-2, or SRC2-2}

Can Op1 co-execute with Op2?

When Rd-1 ~= either{SRC1-3, or SRC2-3}

Can Op2 co-execute with Op3?

When Rd-2 ~= either{SRC1-3, or SRC2-3}

Can Op0/1/2 co-execute?
Can Op1/2/3 co-execute?
----------
So, say, superscalar logic was a lookup over opcode bits for flags like:
can this op run in Lane 2?

Depends on what is in Lane 2

My case:
Lane 1:
MOV, ALU, CONV1/2, SHAD, CMP, LEA/MEM, MUL,
BRANCH, FPU/SIMD (*), ...
Lane 2:
MOV, ALU, CONV1/2, SHAD, LEA, FPU/SIMD (*)
*: But, Lane 1/2 can't co-issue FPU or SIMD unless "compatible".
Lane 3:
MOV, ALU, CONV1

Where:
MOV: 1=cycle register MOV (includes constant load);
ALU: Basic ALU instructions
CONV1: Basic Converter ops (sign/zero extension, etc)
CONV2: Advanced converter ops (SIMD, etc)
SHAD: Integer Shift
CMP: Integer Compare
LEA: Address computation
MEM: Memory Load/Store
MUL: Integer Multiply
BRANCH: Obvious enough
FPU/SIMD: Obvious enough

Lane 3 originally had SHAD as well, but it was dropped mostly because
rarely used so harder to justify cost.

So, based on which lanes have the needed units, it determines which
flags for the lanes it is allowed to run in.

There are some gains from having the compiler trying shuffle
instructions around and look for an ordering that fits well into the
pipeline.

Though, a more advanced IF stage would maybe have a mechanism to allow swapping instructions if they would be able to co-execute but would need
to do so in a different lane ordering.

This could maybe overlap with a TODO item of making fetches
align/justify the instructions with the correct lanes rather than have
the ID stage deal with this part (say, for example, so that Lane3 always
uses the same decoder and could allow for more corner-cutting, and less register-port routing logic).

Can this op run in Lane 3?

...

Can this op run with another op in Lane 2?
Can this op run with another op in Lane 3?
Does this op use Rd as a source?
Does this op use Rt as a source?

Given nomenclature like Mc88120 where {
Lanes = {MEM0, MEM1, MEM2, FADD, FMUL, Branch}
And MEM has an integer unit, and a shift unit
FADD has an integer unit
FMUL has an integer unit
Branch has an integer unit }
And each unit is buffered with its own reservation station;
You just let the RSs create a solution.

Given nomenclature like M5 with >10 FUs, the calculation is harder,
but you still just let the RSs create the solution.
--------------

I was using a convention where the pipeline is divided into 3 lanes.
So, everything plugs into a single unified pipeline, rather than a
separate sub-pipeline for each FU.

Each has fixed register ports and other resources:
Lane 1: Rs, Rt, Imm1 (33b)
Lane 2: Ru, Rv, Imm2 (33b)
Lane 3: Rx, Ry, Imm3 (17b/33b)

Each lane has a register write port, along with some flag bits for
whether the result is ready (for sake of register forwarding), ...

Can put an ALU op into each, or whichever instruction into a given lane
that the lane in question has access to a unit capable of handling it.

Say, for example:
If you tried putting a Shift of Multiply or SIMD op in Lane 3, it
wouldn't work, because Lane 3 lacks the logic to handle it.

As noted, each lane normally only has 2 register read ports and 1
immediate. If an instruction needs 3 inputs, or a 2nd immediate (17b
only for now), it eats Lane 3 (which can no longer hold an instruction).

Or, if a SIMD op happens, Lane 1 eats both 2 and 3, turning them
effectively into a single wider lane for that instruction.

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Mon May 18 17:39:59 2026

From Newsgroup: comp.arch

On Sun, 17 May 2026 15:24:20 +0000, quadi wrote:

So I need additional opcode space for 48-bit instructions.

I managed to find enough space for the 48-bit instructions without taking
any from elsewhere.

However, I'm now encountering a problem with the 32-bit instructions.
Given how I'm handlng other sizes of immediates, I want all 32 registers
to be possible destinations for the 16-bit immediates.

This leads to an opcode space shortage for 32-bit operate instructions.
There was a little slack in the existing 32-bit instructions that I could squeeze, but not enough.

The amount needed, though, is 1/3 the size of what the 16-bit short instructions take, or the same as what the 24-bit short instructions take.

Possible easy and obvious alternatives:

1) Drop the 24-bit short instructions, they're a weird length.
2) Go to 6-bit opcodes for the 16-bit short instructions, limiting them to
the most important data types.
3) Stick with only 8 (or even only 16) registers as the destination of a 16-bit immediate.

Maybe I can squeeze more and avoid having to do any of them; if I must
choose, (2) sounds like the most attractive, as a short instruction that
can only work on the first 8 registers is disfavored anyways.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 18 17:53:23 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Sun, 17 May 2026 15:24:20 +0000, quadi wrote:

So I need additional opcode space for 48-bit instructions.

I managed to find enough space for the 48-bit instructions without taking any from elsewhere.

However, I'm now encountering a problem with the 32-bit instructions.
Given how I'm handlng other sizes of immediates, I want all 32 registers
to be possible destinations for the 16-bit immediates.

This leads to an opcode space shortage for 32-bit operate instructions. There was a little slack in the existing 32-bit instructions that I could squeeze, but not enough.

The amount needed, though, is 1/3 the size of what the 16-bit short instructions take, or the same as what the 24-bit short instructions take.

I think you should introduce Peter to Paul.

Possible easy and obvious alternatives:

1) Drop the 24-bit short instructions, they're a weird length.
2) Go to 6-bit opcodes for the 16-bit short instructions, limiting them to the most important data types.

And only for the most used OpCodes.

3) Stick with only 8 (or even only 16) registers as the destination of a 16-bit immediate.

Unlikely to be compiler friendly.

Maybe I can squeeze more and avoid having to do any of them; if I must choose, (2) sounds like the most attractive, as a short instruction that
can only work on the first 8 registers is disfavored anyways.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 18 10:57:29 2026

From Newsgroup: comp.arch

On 5/17/2026 1:37 PM, BGB wrote:

snip

It is more VLIW vs In-Order, and In-Order vs OoO.

VLIW vs In-Order:
-a-a-a VLIW:
-a-a-a-a-a + Slightly cheaper logic;
-a-a-a-a-a - Binary depends more on processor specifics.
-a-a-a In-Order:
-a-a-a-a-a - Needs logic to handle register deps and lookup opcode flags. >>> -a-a-a-a-a + Code does not depend on uArch.

Another disadvantage is less efficient memory utilization due to taking
up space for the template bits. It also causes a correspondingly less efficient memory bandwidth usage. This is particularly apparent in
EPIC, as they only get three instructions in 128 bits versus four in a traditional RISC (Although you could argue the longer instructions do
more, but this isn't proven.).
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	65
Nodes:	6 (0 / 6)
Uptime:	06:08:08
Calls:	862
Files:	1,311
D/L today:	921 files (14,318M bytes)
Messages:	264,697

Concertina III Once Again

Who's Online

System Info