• Re: Misc: Ongoing status...

    From MitchAlsup1@21:1/5 to BGB on Thu Jan 30 23:48:31 2025
    On Thu, 30 Jan 2025 20:00:22 +0000, BGB wrote:

    So, recent features added to my core ISA: None.
    Reason: Not a whole lot that brings much benefit.


    Have ended up recently more working on the RISC-V side of things,
    because there are still gains to be made there (stuff is still more
    buggy, less complete, and slower than XG2).


    On the RISC-V side, did experiment with Branch-compare-Immediate instructions, but unclear if I will carry them over:
    Adds a non-zero cost to the decoder;
    Cost primarily associated with dealing with a second immed.
    Effect on performance is very small (< 1%).

    I find this a little odd--My 66000 has a lot of CPM #immed-BC
    a) so I am sensitive as this is break even wrt RISC-V
    b) But perhaps the small gains is due to something about
    .. how the pair runs down the pipe as opposed to how the
    .. single runs down the pipe.

    In my case, I added them as jumbo-prefixed forms, so:
    BEQI Imm17s, Rs, Disp12s

    IMM17 should be big enough.

    Also added Store-with-Immediate, with a similar mechanism:
    MOV.L Imm17s, (Rm, Disp12s*1)
    As, it basically dropped out for free.

    Also unclear if it will be carried over. Also gains little, as in most
    of the store-with-immediate scenarios, the immediate is 0.

    Most of my ST w/immediate is floating point data--imme17 is not
    going to cut it there.

    Instructions with a less than 1% gain and no compelling edge case, are essentially clutter.

    I can note that some of the niche ops I did add, like special-case
    RGB555 to Index8 or RGBI, were because at least they had a significant
    effect in one use-case (such as, speeding up how quickly the GUI can do redraw operations).

    My usual preference in these cases is to assign 64-bit encodings, as the instructions might only be used in a few edge cases, so it becomes a
    waste to assign them spots in the more valuable 32-bit encoding space.


    The more popular option was seemingly another person's option, to define
    them as 32-bit encodings.
    Their proposal was effectively:
    Bcc Imm5, Rs1', Disp12
    (IOW: a 3-bit register field, in a 32-bit instruction)
    I don't like this, this is very off-balance.
    Better IMO: Bcc Imm6s, Rs1, Disp9s (+/- 512B)

    This is the case were fusing of CMP #imm16-BC into one op is
    better, unless you can use a 64-bit encoding to directly
    encode that.

    The 3-bit register field also makes it nearly useless with my compiler,
    as my compiler (in its RV mode) primarily uses X18..X27 for variables
    (IOW: the callee save registers). But, maybe moot, as either way it
    would still save less than 1%.

    Also, as for any ops with 3-bit registers:
    Would make superscalar harder and more expensive;
    Would add ugly edge cases and cost to the instruction decoder;
    ...

    3-bit register specifier is not much better than dedicated registers
    {like x86 DIV}.

    I would prefer it if people not went that route (and tried to keep
    things at least mostly consistent, trying to avoid making a dog chewed
    mess of the
    already dog chewed
    32-bit ISA).

    If you really feel the need for 3-bit register fields... Maybe, go to a larger encoding?...

    I suggest a psychiatrist.

    When I defined my own version of BccI (with a 64-bit encoding), how many
    new instructions did I need to define in the 32-bit base ISA: Zero.

    How many 64-bit encodings did My 66000 need:: zero.
    {Hint the words following the instruction specifier have no internal
    format}

    <snip>

    But, my overall goal still being:
    Try to make it not suck.
    But, it still kinda sucks.
    And, people don't want to admit that it kinda sucks;
    Or, that going some directions will make things worse.

    On the other hand, I remain upbeat on the ISA I have created.

    Seems like a mostly pointless uphill battle trying to convince anyone of things that (at least to me) seem kinda obvious.

    Do not waste you time teaching pigs to put on lipstick. ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Jan 31 19:30:44 2025
    On Fri, 31 Jan 2025 6:50:24 +0000, BGB wrote:

    On 1/30/2025 5:48 PM, MitchAlsup1 wrote:
    On Thu, 30 Jan 2025 20:00:22 +0000, BGB wrote:

    So, recent features added to my core ISA: None.
    Reason: Not a whole lot that brings much benefit.


    Have ended up recently more working on the RISC-V side of things,
    because there are still gains to be made there (stuff is still more
    buggy, less complete, and slower than XG2).


    On the RISC-V side, did experiment with Branch-compare-Immediate
    instructions, but unclear if I will carry them over:
       Adds a non-zero cost to the decoder;
         Cost primarily associated with dealing with a second immed.
       Effect on performance is very small (< 1%).

    I find this a little odd--My 66000 has a lot of CPM #immed-BC
    a) so I am sensitive as this is break even wrt RISC-V
    b) But perhaps the small gains is due to something about
    .. how the pair runs down the pipe as opposed to how the
    .. single runs down the pipe.


    Issue I had seen is mostly, "How often does it come up?":
    Seemingly, around 100-150 or so instructions between each occurrence on average (excluding cases where the constant is zero; comparing with zero being more common).

    What does it save:
    Typically 1 cycle that might otherwise be spent loading the value into a register (if this instruction doesn't end up getting run in parallel
    with another prior instruction).


    In the BGBCC output, the main case it comes up is primarily in "for()"
    loops (followed by the occasional if-statement), so one might expect
    this would increase its probability of having more of an effect.

    But, seemingly, not enough tight "for()" loops and similar in use for it
    to have a more significant effect.

    So, in the great "if()" ranking:
    if(x COND 0) ... //first place
    if(x COND y) ... //second place
    if(x COND imm) ... //third place

    However, a construct like:
    for(i=0; i<10; i++)
    { ... }
    Will emit two of them, so they are not *that* rare either.

    Since the compiler can see that the loop is always executed; the
    first/top
    checking CMP-BC should not be emitted; leaving only 1.

    Still, a lot rarer in use than:
    val=ptr[idx];
    Though...

    Have noted though that simple constant for-loops are a minority, far
    more often they are something like:
    for(i=0; i<n; i++)
    { ... }
    Which doesn't use any.

    Or:
    while(i--)
    { ... }
    Which uses a compare with zero (in RV, can be encoded with the zero

    I should note:: I have a whole class of conditional branches that
    include comparison to 0, {(signed), (unsigned), (float), (double)}
    and all 6 arithmetic comparands and auxiliary comparisons for NaNs
    and Infinities.

    register; in BJX2 it has its own dedicated instruction due to the lack
    of zero register; some of these were formally dropped in XG3 which does
    have access to a zero register, and encoding an op using a ZR instead is considered as preferable).

    I choose not to waste a register to hold zero. Once you have universal constants it is unnecessary.
    ------------------
    Huawei had a "less bad" encoding, but they burnt basically the entire
    User-1 block on it, so that isn't going to fly.

    Generally, around 95% of the function-local branches can hit in a Disp9,
    vs 98% for Disp12. So, better to drop to Disp9.

    DISP16 reaches farther...

    ------------------


    I suggest a psychiatrist.


    People are pointing to charts gathered by mining binaries and being
    like: "X10 and X11 are the two most commonly used registers".

    But, this is like pointing at x86 and being like:
    "EAX and ECX are the top two registers, who needs such obscure registers
    as ESI and EDI"?...

    Quit listening to them, use your own judgement.


    When I defined my own version of BccI (with a 64-bit encoding), how many >>> new instructions did I need to define in the 32-bit base ISA: Zero.

    How many 64-bit encodings did My 66000 need:: zero.
    {Hint the words following the instruction specifier have no internal
    format}


    I consider the combination of Jumbo-Prefix and Suffix instruction to be
    a 64-bit instruction.

    I consider a multi-word instruction to have an instruction-specifier
    as the first 32-bits, and everything that follows is an attached
    constant.

    The only "prefixes" I have are CARRY and PREDication.

    -----------------------

    However, have noted that XG3 does appear to be faster than the original Baseline/XG1 ISA.


    Where, to recap:
    XG1 (Baseline):
    16/32/64/96 bit encodings;
    16-bit ops can access R0..R15 with 4b registers;
    Only 2R or 2RI forms for 16-bit ops;
    16-bit ISA still fairly similar to SuperH.
    5-bit register fields by default;
    6-bit available for an ISA subset.
    Disp9u and Imm9u/n for most immediate form instructions;
    32 or 64 GPRs, Default 32.
    8 argument registers.
    XG2:
    32/64/96 bit encodings;
    All 16-bit encodings dropped.
    6-bit register fields (via a wonky encoding);
    Same basic instruction format as XG1,
    But, 3 new bits stored inverted in the HOB of instr words;
    Mostly Disp10s and Imm10u/n;
    64 GPRs native;
    16 argument registers.
    XG3:
    Basically repacked XG2;
    Can exist in same encoding space as RISC-V ops;
    Aims for ease of compatibility with RV64G.
    Encoding was made "aesthetically nicer"
    All the register bits are contiguous and non-inverted;
    Most immediate fields are also once again contiguous;
    ...
    Partly reworks branch instructions;
    Scale=4, usually relative to BasePC (like RV);
    Uses RV's register numbering space (and ABI);
    Eg: SP at R2 vs R15, ...
    (Partly carried over from XG2RV, which is now defunct).
    64 GPRs, but fudged into RV ABI rules;
    Can't rebalance ABI without breaking RV compatibility;
    Breaking RV compatibility defeating its point for existing.
    8 argument registers (because of RV ABI).
    Could in theory expand to 16, but would make issues.
    Despite being based on XG2,
    BGBCC treats XG3 as an extension to RISC-V.


    Then, RV:
    16/32; 48/64/96 (Ext)
    Has 16-bit ops:
    Which are horribly dog-chewed,
    and only manage a handful of instructions.
    Many of the ops can only access X8..X15;
    With GCC, enabling RVC saves around 20% off the ".text" size.
    Imm12s and Disp12s for most ops;
    Lots of dog-chew in the encodings (particular Disp fields);
    JAL is basically confetti.
    ...

    My 66000
    32-bit instruction specifier
    if( inst[31..29] == 3b'001 )
    switch( inst[28..26] )
    { // groups with large constants
    case 3b'001 [Rbase+Rindex] memory reference instructions
    case 3b'010 2-operand calculation instructions
    case 3b'100 3-operand calculation instructions
    case 3b'101 1-operand instructions
    }
    else
    switch( inst[31..29] )
    { // 1 word instructions
    case 3b'010 LOOP instruction
    case 3b'011 Branch insructon
    case 3b'100 LD disp16
    case 3b'101 ST disp16
    case 3b'110 integer imm16
    case 3b'111 logical imm16
    }

    Other than minor updates to the constant decoding patterns, this has
    been
    stable since 2012.

    In its basic form, RV is the worst performing option here, but people actually care about RISC-V, so supporting it is value-added.

    Imagine that, an ISA that requires more instructions takes more cycles
    !?!

    --------------
    Seems like a mostly pointless uphill battle trying to convince anyone of >>> things that (at least to me) seem kinda obvious.

    Do not waste you time teaching pigs to put on lipstick. ...


    Theoretically, people who are working on trying to improve performance, should also see obvious things, namely, that the primary issues
    negatively effecting performance are:
    The lack of Register-Indexed Load/Store;
    Cases where immediate and displacement fields are not big enough;
    Lack of Load/Store Pair.

    If you can fix a few 10%+ issues, this will save a whole lot more than focusing on 1% issues.

    Better to go to the 1% issues *after* addressing the 10% issues.


    If 20-30% of the active memory accesses are for arrays, and one needs to
    do, SLLI+ADD+Ld/St, this sucks.

    If your Imm12 fails, and you need to do:
    LUI+ADDI+Op
    This also sucks.

    If your Disp12 fails, and you do LUI+ADD+Ld/St, likewise.

    They can argue, but with Zba, we can do:
    SHnADD+Ld/St
    But, this is still worse than a single Ld/St.

    Imagine accessing an external array with 64-bit virtual address space::
    RISC-V
    AUPIC Rt,hi(GOT[#k])
    LDD Rt,lo(GOT[#k])[Rt]
    SLL Rs,Rindex,#3
    ADD Rt,Rt,Rs
    LDD Rt,0[Rt]
    5 instruction words, 2 data words.

    My 66000
    LDD Rt,[IP,,GOT[#k]]
    LDD Rt,[Rt,Ri<<3]
    3 instruction words, 0 data words.
    ------------------------------

    If these issues are addressed, there is around a 30% speedup, even with
    a worse compiler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Feb 1 04:05:58 2025
    On Sat, 1 Feb 2025 1:56:16 +0000, BGB wrote:

    On 1/31/2025 1:30 PM, MitchAlsup1 wrote:

    Generally, around 95% of the function-local branches can hit in a Disp9, >>> vs 98% for Disp12. So, better to drop to Disp9.

    DISP16 reaches farther...


    But...

    Disp16 is not going to fit into such a 32-bit encoding...

    It fit in mine !

    But, say, 16+6+5+3 = 30.
    Would have burned the entire 32-bit encoding space on BccI ...

    Which is why one does not do CMP-BC in one instruction !
    The best you can do only covers 60%-odd if the cases.
    -------------------

    In XG3's encoding scheme, a similar construct would give:
    Bcc Imm17s, Rs, Disp10s
    Or:
    Bcc Rt, Rs, Disp33s
    But, where Bcc can still encode R0..R63.

    It is possible that a 96-bit encoding could be defined:
    Bcc Imm26s, Rs, Disp33 //RV+Jx
    Bcc Imm30s, Rs, Disp33 //XG3

    Having not found a function that takes ¼GB of space, I remain
    comfortable with 28-bit branch displacement range. I also have
    CALL instructions that reach 32-bit or 64-bit VAS. ------------------------------
    Granted, I understand a prefix as being fetched and decoded at the same
    time as the instruction it modifies.

    Instruction needs to be plural.

    Some people seem to imagine prefixes as executing independently and then setting up some sort of internal registers which carry state over to the following instruction.

    Instruction needs to be plural.
    --------------
    Ironically though, the GCC and Clang people, and RV people, are
    seemingly also adverse to scenarios that involve using implicit runtime calls.

    Granted, looking at it, I suspect things like implicit runtime calls (or call-threaded code), would be a potential "Achilles's Heel" situation
    for GCC performance, as its register allocation strategy seems to prefer using scratch registers and then to spill them on function calls (rather
    than callee-save registers which don't require a spill).

    I know of a senior compiler writer at CRAY who would argue that
    callee-save registers are an anathema--and had a litany of reasons
    thereto (now long forgotten by me).

    So, if one emits chunks of code that are basically end-to-end function
    calls, they may perform more poorly than they might have otherwise.

    These lower-level supervisory routines are the ones least capable of
    using callee save registers in a way that saves cycles--often trading
    register MOV instructions for LD instructions setting up arguments
    and putting (ST) results where they can be used later.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun Feb 2 01:22:47 2025
    On Sat, 1 Feb 2025 22:42:39 +0000, BGB wrote:

    On 1/31/2025 10:05 PM, MitchAlsup1 wrote:
    --------------------------------
    Whereas, if performance is dominated by a piece of code that looks like,
    say:
    v0=dytf_int2fixnum(123);
    v1=dytf_int2fixnum(456);
    v2=dytf_mul(v0, v1);
    v3=dytf_int2fixnum(789);
    v4=dytf_add(v2, v3);
    v5=dytf_wrapsymbol("x");
    dytf_storeindex(obj, v5, v4);
    ...
    With, say, N levels of call-graph in each called function, but with this
    sort of code still managing to dominate the total CPU ("Self%" time).

    This seems to be a situation where callee-save registers are a big win
    for performance IME.

    With callee save registers, the prologue and epilogue of subroutines
    sees all the save/restore memory traffic; sometimes saving a register
    that is not "in use" and restoring it later.

    With caller save registers, the caller saves exactly the registers
    it needs preserved, while the callee saves/restores none. Moreover
    it only saves registers currently "in use" and may defer restoring
    since it does not need that value in that register for a while.

    So, the instruction path length has a better story in caller saves
    than callee saves. Nothing that was "Not live" is ever saved or
    restored.

    The arguments for callee save have to do with I cache footprint.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)