Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 40 |
Nodes: | 6 (1 / 5) |
Uptime: | 17:16:16 |
Calls: | 291 |
Files: | 910 |
Messages: | 76,581 |
Had recently been working on getting BGBCC to target RV64G.M66: 1 instruction
Array Load/Store:
XG2: 1 instructionM66: 1 instruction (anywhere in 64-bit memory)
RV64: 3 instructions
Global Variable:
XG2: 1 instruction (if within 2K of GBR)M66: 0 instructions
RV64: 1 or 4 instructions
Constant Load into register (not R5):
XG2: 1 instructionM66: 1 instruction
RV64: ~ 1-6
Operator with 32-bit immediate:
BJX2: 1 instruction;M66: 1 instruction
RV64: 3 instructions.
Operator with 64-bit immediate:
BJX2: 1 instruction;
RV64: 4-9 instructions.
Floating point is still a bit of a hack, as it is currently implemented
by shuffling values between GPRs and FPRs, but sorta works.
RV's selection of 3R compare ops is more limited:
RV: SLT, SLTU
BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
A lot of these cases require a multi-op sequence to implement with just
SLT and SLTU.
....
On 9/27/2024 7:50 AM, Robert Finch wrote:
On 2024-09-27 5:46 a.m., BGB wrote:---------
But, BJX2 does not spam the ADD instruction quite so hard, so is more forgiving of latency. In this case, an optimization that reduces
common-case ADD to 1 cycle was being used (it only works though in the
CPU core if the operands are both in signed 32-bit range and no overflow occurs; IIRC optionally using a sign-extended AGU output as a stopgap
ALU output before the output arrives from the main ALU the next cycle).
Comparably, it appears BGBCC leans more heavily into ADD and SLLI than
GCC does, with a fair chunk of the total instructions executed being
these two (more cycles are spent adding and shifting than doing memory
load or store...).
That seems to be a bit off. Mem ops are usually around 1/4 of
instructions. Spending more than 25% on adds and shifts seems like a
lot. Is it address calcs? Register loads of immediates?
It is both...
In BJX2, the dominant instruction tends to be memory Load.
Typical output from BGBCC for Doom is (at runtime):
~ 70% fixed-displacement;
~ 30% register-indexed.
Static output differs slightly:
~ 84% fixed-displacement;
~ 16% register-indexed.
RV64G lacks register-indexed addressing, only having fixed displacement.
If you need to do a register-indexed load in RV64:
SLLI X5, Xo, 2 //shift by size of index
ADD X5, Xm, X5 //add base and index
LW Xn, X5, 0 //do the load
This case is bad...
Also global variables outside the 2kB window:
LUI X5, DispHi
ADDI X5, X5, DispLo
ADD X5, GP, X5
LW Xn, X5, 0
Where, sorting global variables by usage priority gives:
~ 35%: in range
~ 65%: not in range
Comparably, XG2 has a 16K or 32K reach here (depending on immediate
size), which hits most of the global variables. The fallback Jumbo
encoding hits the rest.
Theoretically, could save 1 instruction here, but would need to add two
more reloc types to allow for:
LUI, ADD, Lx
LUI, ADD, Sx
Because annoyingly Load and Store have different displacement encodings;
and I still need the base form for other cases.
More compact way to load/store global variables would be to use absolute 32-bit or PC relative:
LUI + Lx/Sx : Abs32
AUIPC + Lx/Sx : PC-Rel32
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to 32-bit RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the sections for each binary).
Main difference being that FDPIC uses fat
function pointers and does the GP reload on the caller, vs PBO where I
use narrow function pointers and do the reload on the callee (with
load-time fixups for the PBO Offset).
The result of all this is a whole lot ofunnecessary
Shifts and ADDs.
Seemingly, even more for BGBCC than for GCC, which already had a lot of shifts and adds.
BGBCC basically entirely dethrowns the Load and Store ops ...
Possibly more so than GCC, which tended to turn most constant loads into memory loads. It would load a table of constants into a register and
then pull constants from the table, rather than compose them inline.
Say, something like:
AUIPC X18, X18, DispHi
ADD X18, X18, DispLo
(X18 now holds a table of constants, pointing into .rodata)
And, when it needs a constant:
LW Xn, X18, Disp //offset of the constant it wants.
Or:
LD Xn, X18, Disp //64-bit constant
Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.
But, better still would be having Jumbo prefixes or similar, or even a
SHORI instruction.
Say, 64-bit constant-load in SH-5 or similar:
xxxxyyyyzzzzwwww
MOV ImmX, Rn
SHORI ImmY, Rn
SHORI ImmZ, Rn
SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.
Don't you ever snip anything ??
On 9/27/2024 10:52 AM, MitchAlsup1 wrote:
My 66000 can do:: 1 < i && i <= MAX in 1 instruction
BJX2:
CMPQGT R4, 1, R16
CMPQLT R4, (MAX+1), R17 //*1
AND R16, R17, R5
So, more than 1 instruction, but less than faking it with SLT / SLTI ...
It is better for performance though to be able to flip the output bit in
the pipeline than to need to use an XOR instruction or similar.
On 9/27/2024 2:40 PM, MitchAlsup1 wrote:
On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:
But, generally this does still impose limits:
Can't reorder instructions across a label;
Can't move instructions with an associated reloc;
Can't reorder memory instructions unless they can be proven to not alias (loads may be freely reordered, but the relative order of loads and
stores may not unless provably non-aliasing);
The effectiveness of this does depend on how the C code is written
though (works favorably with larger blocks of mostly-independent expressions).
-----Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
such that your typical memref needs a support instruction.
Cough, RV64...
-----Which makes that 16% (above) into 48% and renormalizing to::
~ 63% fixed-displacement;
~ 36% register-indexed and support instructions.
Yeah.
I think there are reasons here why I am generally getting lackluster performance out of RV64...
Comparably, XG2 has a 16K or 32K reach here (depending on immediate
size), which hits most of the global variables. The fallback Jumbo
encoding hits the rest.
I get ±32K with 16-bit displacements
Baseline has special case 32-bit ops:
MOV.L (GBR, Disp10u), Rn //4K
MOV.Q (GBR, Disp10u), Rn //8K
But, in XG2, it gains 2 bits:
MOV.L (GBR, Disp12u), Rn //16K
MOV.Q (GBR, Disp12u), Rn //32K
Jumbo can encode +/- 4GB here (64-bit encoding).
MOV.L (GBR, Disp33s), Rn //+/- 4GB
MOV.Q (GBR, Disp33s), Rn //+/- 4GB
Mostly because GBR displacements are unscaled.
Plan for XG3 is that all Disp33s encodings would be unscaled.
BJX2 can also do (PC, Disp33s) in a single logical instruction...
But, RISC-V can't...
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to 32-bit >>> RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the
sections for each binary).
How are you going to do dense PIC switch() {...} in RISC-V ??
Already implemented...
With pseudo-instructions:
SUB Rs, $(MIN), R10
MOV $(MAX-MIN), R11
BGTU R11, R10, Lbl_Dfl
MOV .L0, R6 //AUIPC+ADD
SHAD R10, 2, R10 //SLLI
ADD R6, R10, R6
JMP R6 //JALR X0, X6, 0
.L0:
BRA Lbl_Case0 //JAL X0, Lbl_Case0
BRA Lbl_Case1
...
Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.
But, better still would be having Jumbo prefixes or similar, or even a
SHORI instruction.
Better Still Still is having 32-bit and 64-bit constants available
from the instruction stream and positioned in either operand position.
Granted...
Say, 64-bit constant-load in SH-5 or similar:
xxxxyyyyzzzzwwww
MOV ImmX, Rn
SHORI ImmY, Rn
SHORI ImmZ, Rn
SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.
Yech
But, 4 is still less than 6.