• Re: Misc: BGBCC targeting RV64G, initial results...

    From MitchAlsup1@21:1/5 to BGB on Fri Sep 27 15:52:34 2024
    On Fri, 27 Sep 2024 9:46:01 +0000, BGB wrote:

    Had recently been working on getting BGBCC to target RV64G.

    Array Load/Store:
    M66: 1 instruction
    XG2: 1 instruction
    RV64: 3 instructions

    Global Variable:
    M66: 1 instruction (anywhere in 64-bit memory)
    XG2: 1 instruction (if within 2K of GBR)
    RV64: 1 or 4 instructions

    Constant Load into register (not R5):
    M66: 0 instructions
    XG2: 1 instruction
    RV64: ~ 1-6

    Operator with 32-bit immediate:
    M66: 1 instruction
    BJX2: 1 instruction;
    RV64: 3 instructions.

    Operator with 64-bit immediate:
    M66: 1 instruction
    BJX2: 1 instruction;
    RV64: 4-9 instructions.



    Floating point is still a bit of a hack, as it is currently implemented
    by shuffling values between GPRs and FPRs, but sorta works.

    My 66000 has a common register file.


    RV's selection of 3R compare ops is more limited:
    RV: SLT, SLTU
    BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
    A lot of these cases require a multi-op sequence to implement with just
    SLT and SLTU.

    My 55000 can do:: 1 < i && i <= MAX in 1 instruction


    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Sep 27 19:40:32 2024
    On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:

    On 9/27/2024 7:50 AM, Robert Finch wrote:
    On 2024-09-27 5:46 a.m., BGB wrote:
    ---------

    But, BJX2 does not spam the ADD instruction quite so hard, so is more forgiving of latency. In this case, an optimization that reduces
    common-case ADD to 1 cycle was being used (it only works though in the
    CPU core if the operands are both in signed 32-bit range and no overflow occurs; IIRC optionally using a sign-extended AGU output as a stopgap
    ALU output before the output arrives from the main ALU the next cycle).

    RISC-V group opinion is that "we have done nothing to damage pipeline
    operating frequency". {{Except the moving of register specifier fields
    between 32-bit and 16-bit instructions; except for: AGEN-RAM-CMP-ALIGN
    in 2 cycles, and several others...}}


    Comparably, it appears BGBCC leans more heavily into ADD and SLLI than
    GCC does, with a fair chunk of the total instructions executed being
    these two (more cycles are spent adding and shifting than doing memory
    load or store...).

    That seems to be a bit off. Mem ops are usually around 1/4 of

    Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
    such that your typical memref needs a support instruction.

    instructions. Spending more than 25% on adds and shifts seems like a
    lot. Is it address calcs? Register loads of immediates?


    It is both...


    In BJX2, the dominant instruction tends to be memory Load.
    Typical output from BGBCC for Doom is (at runtime):
    ~ 70% fixed-displacement;
    ~ 30% register-indexed.
    Static output differs slightly:
    ~ 84% fixed-displacement;
    ~ 16% register-indexed.

    RV64G lacks register-indexed addressing, only having fixed displacement.

    If you need to do a register-indexed load in RV64:
    SLLI X5, Xo, 2 //shift by size of index
    ADD X5, Xm, X5 //add base and index
    LW Xn, X5, 0 //do the load

    This case is bad...

    Which makes that 16% (above) into 48% and renormalizing to::
    ~ 63% fixed-displacement;
    ~ 36% register-indexed and support instructions.


    Also global variables outside the 2kB window:
    LUI X5, DispHi
    ADDI X5, X5, DispLo
    ADD X5, GP, X5
    LW Xn, X5, 0

    Where, sorting global variables by usage priority gives:
    ~ 35%: in range
    ~ 65%: not in range

    Illustrating the falicy of 12-bits of displacement.

    Comparably, XG2 has a 16K or 32K reach here (depending on immediate
    size), which hits most of the global variables. The fallback Jumbo
    encoding hits the rest.

    I get ±32K with 16-bit displacements


    Theoretically, could save 1 instruction here, but would need to add two
    more reloc types to allow for:
    LUI, ADD, Lx
    LUI, ADD, Sx
    Because annoyingly Load and Store have different displacement encodings;
    and I still need the base form for other cases.


    More compact way to load/store global variables would be to use absolute 32-bit or PC relative:
    LUI + Lx/Sx : Abs32
    AUIPC + Lx/Sx : PC-Rel32

    MEM Rd,[IP,,DISP32/64] // IP-rel

    -----

    Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
    (there does seem to be some interest for ELF FDPIC but limited to 32-bit RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
    from PBO (namely, using GP for a global section and then chaining the sections for each binary).

    How are you going to do dense PIC switch() {...} in RISC-V ??

    Main difference being that FDPIC uses fat
    function pointers and does the GP reload on the caller, vs PBO where I
    use narrow function pointers and do the reload on the callee (with
    load-time fixups for the PBO Offset).


    The result of all this is a whole lot of
    unnecessary
    Shifts and ADDs.

    Seemingly, even more for BGBCC than for GCC, which already had a lot of shifts and adds.

    BGBCC basically entirely dethrowns the Load and Store ops ...


    Possibly more so than GCC, which tended to turn most constant loads into memory loads. It would load a table of constants into a register and
    then pull constants from the table, rather than compose them inline.

    Say, something like:
    AUIPC X18, X18, DispHi
    ADD X18, X18, DispLo
    (X18 now holds a table of constants, pointing into .rodata)

    And, when it needs a constant:
    LW Xn, X18, Disp //offset of the constant it wants.
    Or:
    LD Xn, X18, Disp //64-bit constant


    Currently, BGBCC does not use this strategy.
    Though, for 64-bit constants it could be more compact and faster.

    But, better still would be having Jumbo prefixes or similar, or even a
    SHORI instruction.

    Better Still Still is having 32-bit and 64-bit constants available
    from the instruction stream and positioned in either operand position.

    Say, 64-bit constant-load in SH-5 or similar:
    xxxxyyyyzzzzwwww
    MOV ImmX, Rn
    SHORI ImmY, Rn
    SHORI ImmZ, Rn
    SHORI ImmW, Rn
    Where, one loads the constant in 16-bit chunks.

    Yech



    Don't you ever snip anything ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Sep 27 23:21:01 2024
    On Fri, 27 Sep 2024 21:28:41 +0000, BGB wrote:

    On 9/27/2024 10:52 AM, MitchAlsup1 wrote:

    My 66000 can do::  1 < i && i <= MAX in 1 instruction


    BJX2:
    CMPQGT R4, 1, R16
    CMPQLT R4, (MAX+1), R17 //*1
    AND R16, R17, R5

    So, more than 1 instruction, but less than faking it with SLT / SLTI ...

    CMP Rt,Ri,MAX
    BFIN Rt,label // fin = Fortran IN
    -----

    It is better for performance though to be able to flip the output bit in
    the pipeline than to need to use an XOR instruction or similar.

    I do integer negation by flipping all the bits and running a carry
    into the subsequent ALU.

    Thus, if the target calculation unit is logical, one gets inversion
    integer gets negation, and FPUs only invert the sign bit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Sep 28 00:43:49 2024
    On Fri, 27 Sep 2024 23:53:22 +0000, BGB wrote:

    On 9/27/2024 2:40 PM, MitchAlsup1 wrote:
    On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:

    But, generally this does still impose limits:
    Can't reorder instructions across a label;
    Can't move instructions with an associated reloc;

    I always did code motion prior to assembler. Code motion only has to
    consider:: 1-operand, 2-operand, 3-operand, branch, label, LD, ST.

    Can't reorder memory instructions unless they can be proven to not alias (loads may be freely reordered, but the relative order of loads and
    stores may not unless provably non-aliasing);

    Same base register different displacement.

    The effectiveness of this does depend on how the C code is written
    though (works favorably with larger blocks of mostly-independent expressions).

    One of the reasons reservation stations became in vouge.

    Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
    such that your typical memref needs a support instruction.


    Cough, RV64...
    -----
    Which makes that 16% (above) into 48% and renormalizing to::
          ~ 63% fixed-displacement;
          ~ 36% register-indexed and support instructions.

    Yeah.

    I think there are reasons here why I am generally getting lackluster performance out of RV64...
    -----
    Comparably, XG2 has a 16K or 32K reach here (depending on immediate
    size), which hits most of the global variables. The fallback Jumbo
    encoding hits the rest.

    I get ±32K with 16-bit displacements


    Baseline has special case 32-bit ops:
    MOV.L (GBR, Disp10u), Rn //4K
    MOV.Q (GBR, Disp10u), Rn //8K

    But, in XG2, it gains 2 bits:
    MOV.L (GBR, Disp12u), Rn //16K
    MOV.Q (GBR, Disp12u), Rn //32K

    Jumbo can encode +/- 4GB here (64-bit encoding).
    MOV.L (GBR, Disp33s), Rn //+/- 4GB
    MOV.Q (GBR, Disp33s), Rn //+/- 4GB

    Mostly because GBR displacements are unscaled.
    Plan for XG3 is that all Disp33s encodings would be unscaled.

    The assembler gets to choose based on the memory model::

    MEM Rd,[Rb,Ri<<s,DISP]

    Assembler (or even linker) can choose 32-bit or 64 bit based on a
    variety
    of things {flags, memory model, size of linked module,...}

    BJX2 can also do (PC, Disp33s) in a single logical instruction...

    But, RISC-V can't...

    What is your definition of "single logical instruction". In my parlance,
    a single logical instruction can be::

    ST #64-bit-const,[Rb,Ri<<s,DISP64]

    is 1 instruction occupying 5 words.



    Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
    (there does seem to be some interest for ELF FDPIC but limited to 32-bit >>> RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
    from PBO (namely, using GP for a global section and then chaining the
    sections for each binary).

    How are you going to do dense PIC switch() {...} in RISC-V ??

    Already implemented...

    With pseudo-instructions:
    SUB Rs, $(MIN), R10
    MOV $(MAX-MIN), R11
    BGTU R11, R10, Lbl_Dfl

    MOV .L0, R6 //AUIPC+ADD
    SHAD R10, 2, R10 //SLLI
    ADD R6, R10, R6
    JMP R6 //JALR X0, X6, 0

    .L0:
    BRA Lbl_Case0 //JAL X0, Lbl_Case0
    BRA Lbl_Case1
    ...

    Compared to::
    // ADD Rt,Rswitch,#-min
    JTT Rt,#max
    .jttable min, ... , max, default
    adder:

    The ADD is not necessary if min == 0

    The JTT instruction compared Rt with 0 on the low side and max
    on the high side. If Ri is out of bounds, default is selected.

    The table displacements come in {B,H,W,D} selected in the JTT
    (jump through table) instruction. Rt indexes the table, its
    signed value is <<2 and added to address which happens to be
    address of JTT instruction + #(max+1)<<entry. {{The table is
    fetched through the ICache with execute permission}}

    Thus, the table is PIC; and generally 1/4 the size of typical
    switch tables.
    -----
    Currently, BGBCC does not use this strategy.
    Though, for 64-bit constants it could be more compact and faster.

    But, better still would be having Jumbo prefixes or similar, or even a
    SHORI instruction.

    Better Still Still is having 32-bit and 64-bit constants available
    from the instruction stream and positioned in either operand position.


    Granted...


    Say, 64-bit constant-load in SH-5 or similar:
       xxxxyyyyzzzzwwww
       MOV   ImmX, Rn
       SHORI ImmY, Rn
       SHORI ImmZ, Rn
       SHORI ImmW, Rn
    Where, one loads the constant in 16-bit chunks.

    Yech


    But, 4 is still less than 6.

    1 is less than 4, too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)