• Interpreters and caller-saved registers

    From anton@anton@mips.complang.tuwien.ac.at to comp.compilers on Fri Oct 13 07:44:46 2023
    From Newsgroup: comp.compilers

    In August we discussed in comp.arch that Intel's APX will increase the
    number of caller-saved registers, but not the number of callee-saved
    registers. And I pointed out that more caller-saved registers are
    hardly useful for interpreters.

    This discussion inspired me to reconsider how we might use
    caller-saved registers for stack caching in Gforth. With stack-caching
    as used in Gforth [ertl&gregg05], 0-n stack items reside in registers,
    with n+1 different stack representations (one for each number of
    registers on the stack).

    E.g., on RISC-V the stack representations S0-S3 are:

    stack
    elements S0 S1 S2 S3
    top 8(s8) s7 s0 s2
    second 16(s8) 8(s8) s7 s0
    third 24(s8) 16(s8) 8(s8) s7
    fourth 32(s8) 24(s8) 16(s8) 8(s8)

    With these stack representations, the definition

    : cubed dup dup * * ;

    is compiled into:

    $3F9F301748 dup 1->2
    3F9EFA4ACA: mv s0,s7
    $3F9F301750 dup 2->3
    3F9EFA4ACC: mv s2,s0
    $3F9F301758 * 3->2
    3F9EFA4ACE: mul s0,s0,s2
    $3F9F301760 * 2->1
    3F9EFA4AD2: mul s7,s7,s0
    $3F9F301768 ;s 1->1
    3F9EFA4AD6: ld a6,0(s9)
    3F9EFA4ADA: addi s9,s9,8
    3F9EFA4ADC: mv s10,a6
    3F9EFA4ADE: ld a5,0(s10)
    3F9EFA4AE2: jr a5

    We use n=3 if enough registers are available, but until September 2023
    we used n=1 on AMD64 because of a lack of registers. (n>3 results in diminishing returns [ertl&gregg05].)

    So on AMD64 we got:

    $7F4534F89B00 dup 1->1
    0x00007f4534c41030: mov %r8,0x0(%r13)
    0x00007f4534c41034: sub $0x8,%r13
    $7F4534F89B08 dup 1->1
    0x00007f4534c41038: mov %r8,0x0(%r13)
    0x00007f4534c4103c: sub $0x8,%r13
    $7F4534F89B10 * 1->1
    0x00007f4534c41040: imul 0x8(%r13),%r8
    0x00007f4534c41045: add $0x8,%r13
    $7F4534F89B18 * 1->1
    0x00007f4534c41049: imul 0x8(%r13),%r8
    0x00007f4534c4104e: add $0x8,%r13
    $7F4534F89B20 ;s 1->1
    0x00007f4534c41052: mov (%rbx),%r10
    0x00007f4534c41055: add $0x8,%rbx
    0x00007f4534c41059: mov %r10,%r15
    0x00007f4534c4105c: mov (%r15),%rcx
    0x00007f4534c4105f: jmp *%rcx

    I.e., lots of memory accesses and lots of stack pointer updates.

    As you can see in the RISC-V register names, gcc used callee-saved
    registers (with names starting with "s") rather than caller-saved
    registers (names starting with "t") for the stack cache registers.
    That's because there are >100 VM instructions in Gforth that call
    functions with code like:

    /* start in S1 */
    char *a_addr = (char *)s7;
    free_l(a_addr);
    long wior = 0;
    s7 = wior;
    /* end in S1 */
    ... invoke next VM instruction ...

    (I used s7 here instead of the name of the C variable to make the
    connection with the stuff above clearer).

    Now the idea for making use of caller-saved registers is that, for a
    VM instruction that ends in representation S1, (the variables in)
    registers s0 and s2 are dead, because any transition to S2 or S3 will
    write these registers. Unfortunately, gcc does not know that, and
    assumes that the invocation of the next VM instruction could also
    invoke something that expects the stack to be in S3, where s0 and s2
    are alive. Therefore it thinks that s0 and s2 are alive at the call
    to free() and allocates them in callee-saved registers. In Gforth,
    the VM instructions containing calls all end in S1, so s0 and s2 are
    (in reality, but unknown to gcc) dead after each call, and could be
    allocated to caller-saved registers.

    So the idea is to kill these two registers after the call to free_l()
    (and after the other calls) in the eyes of gcc. More generally, my
    idea was to kill s2 at the end of every VM instruction that ends in
    S2, kill s2 and s0 at the end of every VM instruction that ends in S1,
    and kill s2, s0, and s7 at the end of every VM instruction that ends
    in S0. That should provide the maximum freedom to gcc, which is
    generally believed to help the compiler produce good code.

    How do we kill a variable/register: By writing to it, or making the
    compiler believed that we write to it. One way would be:

    s2 = 0;

    However, this costs an instruction; a cheap one on sophisticated microarchitectures, but still. So what we actually do is:

    asm("":"=X"(s2))

    This tells gcc that the asm statement writes to s2, and thus kills it,
    but it actually does not generate any assembly language.

    I implemented this and then added S2 and S3 to AMD64, and gcc now
    indeed managed to keep s2 and s0 in caller-saved registers.

    Unfortunately, gcc-11.4 also introduced two additional redundant move instructions in every VM instruction, and Bernd Paysan reported that
    gcc-12 and gcc-13 introduced even more superfluous code in every VM instruction. This is similar to what we have seen from gcc-3.0 for
    Gforth at that time, and what we have seen from clang last we tried
    it.

    I tried to work around this issue by having the kills only at the end
    of VM instructions that perform a call, and indeed, that worked for
    gcc-11.4. However, gcc-12 and gcc-13 still produced bad code.
    Finally Bernd Paysan had the right idea and added -fno-tree-vectorize
    to the list of options that we use to avoid gcc shenanigans, and now
    we can also use this idea with gcc-12 and gcc-13.

    The result is that the code for CUBED now looks as follows on AMD64:

    $7F9D0D2531F0 dup 1->2
    0x00007f9d0cefa8b2: mov %r8,%r15
    $7F9D0D2531F8 dup 2->3
    0x00007f9d0cefa8b5: mov %r15,%r9
    $7F9D0D253200 * 3->2
    0x00007f9d0cefa8b8: imul %r9,%r15
    $7F9D0D253208 * 2->1
    0x00007f9d0cefa8bc: imul %r15,%r8
    $7F9D0D253210 ;s 1->1
    0x00007f9d0cefa8c0: mov (%r14),%rbx
    0x00007f9d0cefa8c3: add $0x8,%r14
    0x00007f9d0cefa8c7: mov (%rbx),%rax
    0x00007f9d0cefa8ca: jmp *%rax

    In terms of performance, the difference is (on a 4GHz Core i5-6600K,
    numbers in seconds):

    sieve bubble matrix fib fft
    0.070 0.096 0.029 0.058 0.022 with S0-S1
    0.060 0.091 0.019 0.051 0.021 with S0-S3

    Concerning the discussion of caller-saved vs. callee-saved registers,
    if AMD64 had more callee-saved registers, it would have seen these
    performance benefits many years ago, and we would not have had to
    spend the work to introduce the kills, find out the problems, and work
    around them.

    This optimization is quite specific to the stack caching optimization.
    A more general approach for VM interpreters is to surround every call
    with code saving VM registers before and code restoring them
    afterwards. This allows the C compiler to put the VM registers into caller-saved registers. E.g.:

    vmstate->ip = ip;
    vmstate->sp = sp;
    ...
    free_l(a_addr);
    ip = vmstate->ip;
    sp = vmstate->sp;
    ...

    Of course, these saves and restores also cost, but at least they allow
    to use caller-saved registers for the VM instructions that do not
    perform calls. To reduce the overhead of these saves and restores,
    one can keep m VM registers in C variables across calls instead of
    saving them, where m is the architecture-specific number of
    callee-saved registers. So architectures with more callee-saved
    instructions have less of this saving and restoring overhead.

    Followups set to comp.arch; change this to comp.compilers (moderated)
    if your followup is more appripriate for that.

    @InProceedings{ertl&gregg05,
    author = {M. Anton Ertl and David Gregg},
    title = {Stack Caching in {Forth}},
    crossref = {euroforth05},
    pages = {6--15},
    url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg05.ps.gz},
    pdfurl = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/ertl%26gregg05.pdf},
    OPTnote = {not refereed},
    abstract = {Stack caching speeds Forth up by keeping stack items
    in registers, reducing the number of memory accesses
    for stack items. This paper describes our work on
    extending Gforth's stack caching implementation to
    support more than one register in the canonical
    state, and presents timing results for the resulting
    Forth system. For single-representation stack
    caches, keeping just one stack item in registers is
    usually best, and provides speedups up to a factor
    of 2.84 over the straight-forward stack
    representation. For stack caches with multiple stack
    representations, using the one-register
    representation as canonical representation is
    usually optimal, resulting in an overall speedup of
    up to a factor of 3.80 (and up to a factor of 1.53
    over single-representation stack caching).}
    }
    @Proceedings{euroforth05,
    title = {21st EuroForth Conference},
    booktitle = {21st EuroForth Conference},
    year = {2005},
    key = {EuroForth'05},
    editor = {M. Anton Ertl},
    url = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/proceedings.pdf}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.compilers on Sun Oct 15 19:52:45 2023
    From Newsgroup: comp.compilers

    [Replying to comp.compilers as this is more pertinent there]

    anton@mips.complang.tuwien.ac.at <anton@mips.complang.tuwien.ac.at> schrieb:

    asm("":"=X"(s2))

    This tells gcc that the asm statement writes to s2, and thus kills it,
    but it actually does not generate any assembly language.

    [...]

    Unfortunately, gcc-11.4 also introduced two additional redundant move instructions in every VM instruction, and Bernd Paysan reported that
    gcc-12 and gcc-13 introduced even more superfluous code in every VM instruction.

    It is well known that compilers in general and gcc specfically often
    generate superflous register moves; there are quite some PRs in
    gcc's bug database on this; I have submitted a few of them myself,
    such as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111373 which
    includes compiler-generated code like

    movq %rdx, %rsi
    movq %rax, %rdx
    movq %rcx, 8(%rdi)
    movq %rsi, %rax
    movq %rdx, 16(%rdi)
    movq %rax, (%rdi)
    ret

    where it is obviously to anybody who can read assembly that the
    register moves are unneeded (although they are likely to
    be zero-cycle operations because of register renaming).

    However, if this got worse between releases, this is a regression.
    Those get higher priority for fixing. So, if it is reasonable
    to generate a reduced test case (for which cvise, for example,
    is an excellent tool) so filing a bug report would be a good thing.

    This is similar to what we have seen from gcc-3.0 for
    Gforth at that time, and what we have seen from clang last we tried
    it.

    I tried to work around this issue by having the kills only at the end
    of VM instructions that perform a call, and indeed, that worked for
    gcc-11.4. However, gcc-12 and gcc-13 still produced bad code.
    Finally Bernd Paysan had the right idea and added -fno-tree-vectorize
    to the list of options that we use to avoid gcc shenanigans, and now
    we can also use this idea with gcc-12 and gcc-13.

    That is strange, and would give valuable hints for investigating
    this regression.

    This sort of code is an example of the contradictions in today's
    compiler technology. On the one hand, they do amazing optimizations
    on large amounts of code which no programmer could hope to reach
    while staying productive. On the other hand, it is very common
    to see glaring inefficiencies when one looks at even small chunks
    of code.

    (A good assembler programmer can often beat compiler-generated
    code by a factor of two or more, especially if SIMD is involved,
    but SIMD is really hard to generate code for).

    So far, nobody has found an algorithm for "just remove the
    silliness" from compiled programs. Maybe it would be feasible to
    run some peephole optimization as last passes which could improve
    code like the one above, but that might also be difficult in the
    more general case where registers are reused in other basic blocks
    (which would mean just to redo the register allocation).

    So, still work to do...
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at to comp.compilers on Thu Oct 19 17:14:28 2023
    From Newsgroup: comp.compilers

    Thomas Koenig <tkoenig@netcologne.de> writes:
    It is well known that compilers in general and gcc specfically often
    generate superflous register moves
    ...
    However, if this got worse between releases, this is a regression.
    Those get higher priority for fixing. So, if it is reasonable
    to generate a reduced test case (for which cvise, for example,
    is an excellent tool) so filing a bug report would be a good thing.

    If you want to file such a bug report, I can give you the commit of
    Gforth before we added all the workarounds, where all the problems
    are visible without ado.

    This sort of code is an example of the contradictions in today's
    compiler technology. On the one hand, they do amazing optimizations
    on large amounts of code which no programmer could hope to reach
    while staying productive.

    There are certainly cases where compilers loop-unroll (plus
    ramp-up/down and compensation code) code into huge swaths of code that
    makes it hard to see where the action is going on, but amazing?

    So far, nobody has found an algorithm for "just remove the
    silliness" from compiled programs. Maybe it would be feasible to
    run some peephole optimization as last passes which could improve
    code like the one above, but that might also be difficult in the
    more general case where registers are reused in other basic blocks
    (which would mean just to redo the register allocation).

    Yes, I doubt that it can be solved with peephole optimization, or it
    would have been done already.

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at to comp.compilers on Tue Oct 24 17:15:04 2023
    From Newsgroup: comp.compilers

    Thomas Koenig <tkoenig@netcologne.de> writes:
    anton@mips.complang.tuwien.ac.at <anton@mips.complang.tuwien.ac.at> schrieb: >> If you want to file such a bug report, I can give you the commit of
    Gforth before we added all the workarounds, where all the problems
    are visible without ado.

    This reply shows an interesting aspect of compiler development that is
    often overlooked: The social aspect.

    Compiler writers generally want to improve their product, but they
    also generally feel that bug submitters (at least those who don't have
    a support contract) should also invest a minimum of work if he wants >something fixed, and a general "look at large package xyz, it'll be
    obvious" is below that threshold. (This is the reason why gcc, for
    example, asks for a complete and self-contained test case in its bug >reports.)

    People who complain about bugs, but are not willing to put in that
    minimum amount of work, are often ignored.

    In my experience with gcc maintainers, when I put in that effort, it
    is wasted, because

    1) the resulting bug report is resolved as invalid (e.g., PR25285) in
    less time than it took me to create it. Moreover gcc maintainers
    who knew much less about the performance implications of the bug
    than I did (I had researched the topic for several years at that
    point), yet wrote patronizingly down to me; but that would have
    been forgiven if they had kept their part of the social contract
    and fixed the bug.

    2) they confirm the bug and do nothing about it (PR93811).

    3) they just do nothing about it (PR 93765).

    In the present case, declaring the bug to be INVALID would be easy
    given that the program uses features outside standard C, and I expect
    that if you make such a bug report, it will be resolved as INVALID.
    So why should anyone waste their time on it? If you think they are
    going to fix it, why don't you invest your time in it?

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/
    --- Synchronet 3.21b-Linux NewsLink 1.2