• Stack caching (: Stack vs stackless operation)

    From Anton Ertl@21:1/5 to Anton Ertl on Sat Mar 1 08:18:06 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    see-code exchange see-code exchange4 see-code exchange2
    over 1->2 dup 1->2 dup >r 1->1
    mov r15,$08[r12] mov r15,r8 >r 1->1
    @ 2->2 @ 2->2 mov -$08[r13],r8
    mov r15,[r15] mov r15,[r15] sub r13,$08
    swap 2->3 rot 2->3 @ 1->1
    add r12,$08 mov r9,$08[r12] mov r8,[r8]
    mov r9,r8 add r12,$08 over 1->2
    mov r8,[r12] !@ 3->2 mov r15,$08[r12]
    !@ 3->2 mov rax,r15 @ 2->2
    mov rax,r15 mov r15,[r9] mov r15,[r15]
    mov r15,[r9] mov [r9],rax r> 2->3
    mov [r9],rax swap 2->3 mov r9,$00[r13]
    swap 2->3 add r12,$08 add r13,$08
    add r12,$08 mov r9,r8 ! 3->1
    mov r9,r8 mov r8,[r12] mov [r9],r15
    mov r8,[r12] ! 3->1 swap 1->2
    ! 3->1 mov [r9],r15 mov r15,$08[r12]
    mov [r9],r15 ;s 1->1 add r12,$08
    ;s 1->1 mov rbx,$00[r13] ! 2->0
    mov rbx,$00[r13] add r13,$08 mov [r15],r8
    add r13,$08 mov rax,[rbx] ;s 0->1
    mov rax,[rbx] jmp eax mov r8,$08[r12]
    jmp eax add r12,$08
    mov rbx,$00[r13]
    add r13,$08
    mov rax,[rbx]
    jmp eax

    The difference between exchange and exchange4 shows how stack caching
    can have a hard-to-predict effect. Gforth searches for the shortest
    path through the available stack-cache states, where shortness is
    defined by the native-code length. E.g., it starts with state 1, and
    from there it can use any of the dup variants starting in state 1, or
    first transition to another state and use a dup variant starting from
    there.

    For SWAP and ROT gforth-fast has the following variants:

    primitive in-out # code bytes
    swap 1-1 132 len= 4+ 13+ 3
    swap 2-2 37 len= 4+ 9+ 3
    swap 3-3 4 len= 4+ 9+ 3
    swap 0-2 8 len= 4+ 14+ 3
    swap 1-2 82 len= 4+ 9+ 3
    swap 2-1 74 len= 4+ 8+ 3
    swap 2-3 30 len= 4+ 11+ 3
    swap 3-2 3 len= 4+ 11+ 3
    swap 2-0 20 len= 4+ 13+ 3
    rot 1-1 46 len= 4+ 23+ 3
    rot 3-3 6 len= 4+ 12+ 3
    rot 3-1 24 len= 4+ 13+ 3
    rot 2-3 15 len= 4+ 9+ 3
    rot 1-3 17 len= 4+ 17+ 3
    rot 0-3 1 len= 4+ 19+ 3

    You get these data (in a rawer form) with

    gforth-fast --print-prim -e bye |& grep ^swap
    gforth-fast --print-prim -e bye |& grep ^rot

    The in column is the stack-cache state on entering the word, the out
    column is the stack-cache state on leaving the word. The # column
    shows how many times this variant of the primitive is used (static
    counts). The code length colum shows three parts, the middle of which
    is the part that's copied to dynamic superinstructions like the ones
    shown above, and this length is what is used in the search for the
    shortest path.

    In EXCHANGE4, the shortest variant of ROT is used: ROT 2->3; and the
    primitives of the other variants are selected to also result in the
    shortest overall code.

    In EXCHANGE and EXCHANGE4, SWAP 2->3 is not the shortest variant of
    SWAP, not even the shortest variant starting from state 2, but ending
    in state 3 allows to use cheap variants of further primitives such as
    !@ and !, resulting in the overall shortest code for this sequence.
    In EXCHANGE2, we see the selection of a shorter version of SWAP, but
    one of the costs is that ;s becomes longer (but in this case the
    overall savings from using a shorter version of SWAP and shorter
    versions of earlier instructions make up for that).

    Why am I looking at this? For stack-shuffling primitives like SWAP
    and ROT, it's not obvious which variant is how long and which variant
    should be selected.

    These stack-shiffling words therefore are also good candidates for
    performing stack-cache state transitions that would otherwise require
    inserting extra transition code:

    E.g., EXCHANGE and its variants consume two stack items, but need to
    start in stack-cache state 1 and end in stack-cache state 1 (gforth is currently not smart enough to deal with other states at basic-block boundaries), so not everything can be done in the stack cache; the
    stack pointer needs to be increased by two cells, and there need to be
    accesses to the memory part of the stack for two stack items.

    In EXCHANGE, the adjustment by one cell and memory access for one
    stack item is done in the first SWAP 2->3, and another one in the
    second one. In EXCHANGE4, ROT 2->3 and SWAP 2->3 perform these tasks.
    In EXCHANGE2, the SWAP 1->3 does it for one cell, and the stack-cache
    state transition 0->1 in the first two instructions of ;s do it for
    the other cell (gforth-fast actually does not have a built-in variant
    ;S 0->1 and the code shown as ;S 0->1 by SEE-CODE is actually composed
    of a transition 0->1 and the ;S 1->1 variant).

    I wanted to know how often which variant of these stack-shuffling
    primitives is used, and how this relates to their length. One
    interesting result is that ROT 1->3 is used relatively frequently
    despite having relatively long code. Apparently the code that comes
    before these 17 instances of ROT benefits a lot from being in
    stack-cache state 1 and this amortizes the longer code for the ROT
    3 compared to ROT 2->3.

    Another interesting result is the low usage of SWAP 3->2 compared to
    SWAP 2->3. This may say something about how SWAP is used in Forth
    programs. Or it may be an artifact of tie-breaking: If two paths have
    the same length, one is chosen rather arbitrarily, but consistently,
    and this may make one variant appear more useful than merited by the
    benefit that the existence of the variant has on code length.

    For those interested, here's the code for the various variants shown
    above:

    r12: stack pointer
    r8: stack cache register a (tos in state 1, 2nd in state 2, 3rd in state 3) r15: stack-cache register b (tos in state 2, 2nd in state 3)
    r9: stack-cache register c (tos in state 3)

    swap 1-1
    559E7F769425: mov rax,$08[r12]
    559E7F76942A: mov $08[r12],r8
    559E7F76942F: mov r8,rax

    swap 2-2
    559E7F76E5B1: mov rax,r8
    559E7F76E5B4: mov r8,r15
    559E7F76E5B7: mov r15,rax

    swap 3-3
    559E7F76E5C3: mov rax,r15
    559E7F76E5C6: mov r15,r9
    559E7F76E5C9: mov r9,rax

    swap 0-2
    559E7F76E5D5: mov r15,$10[r12]
    559E7F76E5DA: mov r8,$08[r12]
    559E7F76E5DF: add r12,$10

    swap 1-2
    559E7F76E5EC: mov r15,$08[r12]
    559E7F76E5F1: add r12,$08

    swap 2-1
    559E7F76E5FE: mov [r12],r15
    559E7F76E602: sub r12,$08

    swap 2-3
    559E7F76E60F: add r12,$08
    559E7F76E613: mov r9,r8
    559E7F76E616: mov r8,[r12]

    swap 3-2
    559E7F76E623: mov [r12],r8
    559E7F76E627: mov r8,r9
    559E7F76E62A: sub r12,$08

    swap 2-0
    559E7F76E637: mov [r12],r15
    559E7F76E63B: sub r12,$10
    559E7F76E63F: mov $08[r12],r8

    rot 1-1
    559E7F76944C: mov rdx,$08[r12]
    559E7F769451: mov rax,$10[r12]
    559E7F769456: mov $08[r12],r8
    559E7F76945B: mov $10[r12],rdx
    559E7F769460: mov r8,rax

    rot 3-3
    559E7F76EDC0: mov rax,r8
    559E7F76EDC3: mov r8,r15
    559E7F76EDC6: mov r15,r9
    559E7F76EDC9: mov r9,rax

    rot 3-1
    559E7F76EDD5: mov [r12],r15
    559E7F76EDD9: sub r12,$10
    559E7F76EDDD: mov $08[r12],r9

    rot 2-3
    559E7F76EDEB: mov r9,$08[r12]
    559E7F76EDF0: add r12,$08

    rot 1-3
    559E7F76EDFD: mov r9,$10[r12]
    559E7F76EE02: mov r15,r8
    559E7F76EE05: add r12,$10
    559E7F76EE09: mov r8,-$08[r12]

    rot 0-3
    559E7F76EE17: mov r9,$18[r12]
    559E7F76EE1C: mov r8,$10[r12]
    559E7F76EE21: add r12,$18
    559E7F76EE25: mov r15,-$10[r12]

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)