Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 35 |
Nodes: | 6 (1 / 5) |
Uptime: | 18:47:39 |
Calls: | 321 |
Calls today: | 1 |
Files: | 957 |
Messages: | 82,382 |
So, recent features added to my core ISA: None.
Reason: Not a whole lot that brings much benefit.
Have ended up recently more working on the RISC-V side of things,
because there are still gains to be made there (stuff is still more
buggy, less complete, and slower than XG2).
On the RISC-V side, did experiment with Branch-compare-Immediate instructions, but unclear if I will carry them over:
Adds a non-zero cost to the decoder;
Cost primarily associated with dealing with a second immed.
Effect on performance is very small (< 1%).
In my case, I added them as jumbo-prefixed forms, so:
BEQI Imm17s, Rs, Disp12s
Also added Store-with-Immediate, with a similar mechanism:
MOV.L Imm17s, (Rm, Disp12s*1)
As, it basically dropped out for free.
Also unclear if it will be carried over. Also gains little, as in most
of the store-with-immediate scenarios, the immediate is 0.
Instructions with a less than 1% gain and no compelling edge case, are essentially clutter.
I can note that some of the niche ops I did add, like special-case
RGB555 to Index8 or RGBI, were because at least they had a significant
effect in one use-case (such as, speeding up how quickly the GUI can do redraw operations).
My usual preference in these cases is to assign 64-bit encodings, as the instructions might only be used in a few edge cases, so it becomes a
waste to assign them spots in the more valuable 32-bit encoding space.
The more popular option was seemingly another person's option, to define
them as 32-bit encodings.
Their proposal was effectively:
Bcc Imm5, Rs1', Disp12
(IOW: a 3-bit register field, in a 32-bit instruction)
I don't like this, this is very off-balance.
Better IMO: Bcc Imm6s, Rs1, Disp9s (+/- 512B)
The 3-bit register field also makes it nearly useless with my compiler,
as my compiler (in its RV mode) primarily uses X18..X27 for variables
(IOW: the callee save registers). But, maybe moot, as either way it
would still save less than 1%.
Also, as for any ops with 3-bit registers:
Would make superscalar harder and more expensive;
Would add ugly edge cases and cost to the instruction decoder;
...
I would prefer it if people not went that route (and tried to keepalready dog chewed
things at least mostly consistent, trying to avoid making a dog chewed
mess of the
32-bit ISA).
If you really feel the need for 3-bit register fields... Maybe, go to a larger encoding?...
When I defined my own version of BccI (with a 64-bit encoding), how many
new instructions did I need to define in the 32-bit base ISA: Zero.
But, my overall goal still being:
Try to make it not suck.
But, it still kinda sucks.
And, people don't want to admit that it kinda sucks;
Or, that going some directions will make things worse.
Seems like a mostly pointless uphill battle trying to convince anyone of things that (at least to me) seem kinda obvious.
On 1/30/2025 5:48 PM, MitchAlsup1 wrote:
On Thu, 30 Jan 2025 20:00:22 +0000, BGB wrote:
So, recent features added to my core ISA: None.
Reason: Not a whole lot that brings much benefit.
Have ended up recently more working on the RISC-V side of things,
because there are still gains to be made there (stuff is still more
buggy, less complete, and slower than XG2).
On the RISC-V side, did experiment with Branch-compare-Immediate
instructions, but unclear if I will carry them over:
Adds a non-zero cost to the decoder;
Cost primarily associated with dealing with a second immed.
Effect on performance is very small (< 1%).
I find this a little odd--My 66000 has a lot of CPM #immed-BC
a) so I am sensitive as this is break even wrt RISC-V
b) But perhaps the small gains is due to something about
.. how the pair runs down the pipe as opposed to how the
.. single runs down the pipe.
Issue I had seen is mostly, "How often does it come up?":
Seemingly, around 100-150 or so instructions between each occurrence on average (excluding cases where the constant is zero; comparing with zero being more common).
What does it save:
Typically 1 cycle that might otherwise be spent loading the value into a register (if this instruction doesn't end up getting run in parallel
with another prior instruction).
In the BGBCC output, the main case it comes up is primarily in "for()"
loops (followed by the occasional if-statement), so one might expect
this would increase its probability of having more of an effect.
But, seemingly, not enough tight "for()" loops and similar in use for it
to have a more significant effect.
So, in the great "if()" ranking:
if(x COND 0) ... //first place
if(x COND y) ... //second place
if(x COND imm) ... //third place
However, a construct like:
for(i=0; i<10; i++)
{ ... }
Will emit two of them, so they are not *that* rare either.
Still, a lot rarer in use than:
val=ptr[idx];
Though...
Have noted though that simple constant for-loops are a minority, far
more often they are something like:
for(i=0; i<n; i++)
{ ... }
Which doesn't use any.
Or:
while(i--)
{ ... }
Which uses a compare with zero (in RV, can be encoded with the zero
register; in BJX2 it has its own dedicated instruction due to the lack
of zero register; some of these were formally dropped in XG3 which does
have access to a zero register, and encoding an op using a ZR instead is considered as preferable).
Huawei had a "less bad" encoding, but they burnt basically the entire
User-1 block on it, so that isn't going to fly.
Generally, around 95% of the function-local branches can hit in a Disp9,
vs 98% for Disp12. So, better to drop to Disp9.
I suggest a psychiatrist.
People are pointing to charts gathered by mining binaries and being
like: "X10 and X11 are the two most commonly used registers".
But, this is like pointing at x86 and being like:
"EAX and ECX are the top two registers, who needs such obscure registers
as ESI and EDI"?...
When I defined my own version of BccI (with a 64-bit encoding), how many >>> new instructions did I need to define in the 32-bit base ISA: Zero.
How many 64-bit encodings did My 66000 need:: zero.
{Hint the words following the instruction specifier have no internal
format}
I consider the combination of Jumbo-Prefix and Suffix instruction to be
a 64-bit instruction.
However, have noted that XG3 does appear to be faster than the original Baseline/XG1 ISA.
Where, to recap:
XG1 (Baseline):
16/32/64/96 bit encodings;
16-bit ops can access R0..R15 with 4b registers;
Only 2R or 2RI forms for 16-bit ops;
16-bit ISA still fairly similar to SuperH.
5-bit register fields by default;
6-bit available for an ISA subset.
Disp9u and Imm9u/n for most immediate form instructions;
32 or 64 GPRs, Default 32.
8 argument registers.
XG2:
32/64/96 bit encodings;
All 16-bit encodings dropped.
6-bit register fields (via a wonky encoding);
Same basic instruction format as XG1,
But, 3 new bits stored inverted in the HOB of instr words;
Mostly Disp10s and Imm10u/n;
64 GPRs native;
16 argument registers.
XG3:
Basically repacked XG2;
Can exist in same encoding space as RISC-V ops;
Aims for ease of compatibility with RV64G.
Encoding was made "aesthetically nicer"
All the register bits are contiguous and non-inverted;
Most immediate fields are also once again contiguous;
...
Partly reworks branch instructions;
Scale=4, usually relative to BasePC (like RV);
Uses RV's register numbering space (and ABI);
Eg: SP at R2 vs R15, ...
(Partly carried over from XG2RV, which is now defunct).
64 GPRs, but fudged into RV ABI rules;
Can't rebalance ABI without breaking RV compatibility;
Breaking RV compatibility defeating its point for existing.
8 argument registers (because of RV ABI).
Could in theory expand to 16, but would make issues.
Despite being based on XG2,
BGBCC treats XG3 as an extension to RISC-V.
Then, RV:
16/32; 48/64/96 (Ext)
Has 16-bit ops:
Which are horribly dog-chewed,
and only manage a handful of instructions.
Many of the ops can only access X8..X15;
With GCC, enabling RVC saves around 20% off the ".text" size.
Imm12s and Disp12s for most ops;
Lots of dog-chew in the encodings (particular Disp fields);
JAL is basically confetti.
...
In its basic form, RV is the worst performing option here, but people actually care about RISC-V, so supporting it is value-added.
Seems like a mostly pointless uphill battle trying to convince anyone of >>> things that (at least to me) seem kinda obvious.
Do not waste you time teaching pigs to put on lipstick. ...
Theoretically, people who are working on trying to improve performance, should also see obvious things, namely, that the primary issues
negatively effecting performance are:
The lack of Register-Indexed Load/Store;
Cases where immediate and displacement fields are not big enough;
Lack of Load/Store Pair.
If you can fix a few 10%+ issues, this will save a whole lot more than focusing on 1% issues.
Better to go to the 1% issues *after* addressing the 10% issues.
If 20-30% of the active memory accesses are for arrays, and one needs to
do, SLLI+ADD+Ld/St, this sucks.
If your Imm12 fails, and you need to do:
LUI+ADDI+Op
This also sucks.
If your Disp12 fails, and you do LUI+ADD+Ld/St, likewise.
They can argue, but with Zba, we can do:
SHnADD+Ld/St
But, this is still worse than a single Ld/St.
If these issues are addressed, there is around a 30% speedup, even with
a worse compiler.
On 1/31/2025 1:30 PM, MitchAlsup1 wrote:
Generally, around 95% of the function-local branches can hit in a Disp9, >>> vs 98% for Disp12. So, better to drop to Disp9.
DISP16 reaches farther...
But...
Disp16 is not going to fit into such a 32-bit encoding...
But, say, 16+6+5+3 = 30.
Would have burned the entire 32-bit encoding space on BccI ...
In XG3's encoding scheme, a similar construct would give:
Bcc Imm17s, Rs, Disp10s
Or:
Bcc Rt, Rs, Disp33s
But, where Bcc can still encode R0..R63.
It is possible that a 96-bit encoding could be defined:
Bcc Imm26s, Rs, Disp33 //RV+Jx
Bcc Imm30s, Rs, Disp33 //XG3
Granted, I understand a prefix as being fetched and decoded at the same
time as the instruction it modifies.
Some people seem to imagine prefixes as executing independently and then setting up some sort of internal registers which carry state over to the following instruction.
Ironically though, the GCC and Clang people, and RV people, are
seemingly also adverse to scenarios that involve using implicit runtime calls.
Granted, looking at it, I suspect things like implicit runtime calls (or call-threaded code), would be a potential "Achilles's Heel" situation
for GCC performance, as its register allocation strategy seems to prefer using scratch registers and then to spill them on function calls (rather
than callee-save registers which don't require a spill).
So, if one emits chunks of code that are basically end-to-end function
calls, they may perform more poorly than they might have otherwise.
On 1/31/2025 10:05 PM, MitchAlsup1 wrote:--------------------------------
Whereas, if performance is dominated by a piece of code that looks like,
say:
v0=dytf_int2fixnum(123);
v1=dytf_int2fixnum(456);
v2=dytf_mul(v0, v1);
v3=dytf_int2fixnum(789);
v4=dytf_add(v2, v3);
v5=dytf_wrapsymbol("x");
dytf_storeindex(obj, v5, v4);
...
With, say, N levels of call-graph in each called function, but with this
sort of code still managing to dominate the total CPU ("Self%" time).
This seems to be a situation where callee-save registers are a big win
for performance IME.