anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
see-code exchange see-code exchange4 see-code exchange2
over 1->2 dup 1->2 dup >r 1->1
mov r15,$08[r12] mov r15,r8 >r 1->1
@ 2->2 @ 2->2 mov -$08[r13],r8
mov r15,[r15] mov r15,[r15] sub r13,$08
swap 2->3 rot 2->3 @ 1->1
add r12,$08 mov r9,$08[r12] mov r8,[r8]
mov r9,r8 add r12,$08 over 1->2
mov r8,[r12] !@ 3->2 mov r15,$08[r12]
!@ 3->2 mov rax,r15 @ 2->2
mov rax,r15 mov r15,[r9] mov r15,[r15]
mov r15,[r9] mov [r9],rax r> 2->3
mov [r9],rax swap 2->3 mov r9,$00[r13]
swap 2->3 add r12,$08 add r13,$08
add r12,$08 mov r9,r8 ! 3->1
mov r9,r8 mov r8,[r12] mov [r9],r15
mov r8,[r12] ! 3->1 swap 1->2
! 3->1 mov [r9],r15 mov r15,$08[r12]
mov [r9],r15 ;s 1->1 add r12,$08
;s 1->1 mov rbx,$00[r13] ! 2->0
mov rbx,$00[r13] add r13,$08 mov [r15],r8
add r13,$08 mov rax,[rbx] ;s 0->1
mov rax,[rbx] jmp eax mov r8,$08[r12]
jmp eax add r12,$08
mov rbx,$00[r13]
add r13,$08
mov rax,[rbx]
jmp eax
The difference between exchange and exchange4 shows how stack caching
can have a hard-to-predict effect. Gforth searches for the shortest
path through the available stack-cache states, where shortness is
defined by the native-code length. E.g., it starts with state 1, and
from there it can use any of the dup variants starting in state 1, or
first transition to another state and use a dup variant starting from
there.
For SWAP and ROT gforth-fast has the following variants:
primitive in-out # code bytes
swap 1-1 132 len= 4+ 13+ 3
swap 2-2 37 len= 4+ 9+ 3
swap 3-3 4 len= 4+ 9+ 3
swap 0-2 8 len= 4+ 14+ 3
swap 1-2 82 len= 4+ 9+ 3
swap 2-1 74 len= 4+ 8+ 3
swap 2-3 30 len= 4+ 11+ 3
swap 3-2 3 len= 4+ 11+ 3
swap 2-0 20 len= 4+ 13+ 3
rot 1-1 46 len= 4+ 23+ 3
rot 3-3 6 len= 4+ 12+ 3
rot 3-1 24 len= 4+ 13+ 3
rot 2-3 15 len= 4+ 9+ 3
rot 1-3 17 len= 4+ 17+ 3
rot 0-3 1 len= 4+ 19+ 3
You get these data (in a rawer form) with
gforth-fast --print-prim -e bye |& grep ^swap
gforth-fast --print-prim -e bye |& grep ^rot
The in column is the stack-cache state on entering the word, the out
column is the stack-cache state on leaving the word. The # column
shows how many times this variant of the primitive is used (static
counts). The code length colum shows three parts, the middle of which
is the part that's copied to dynamic superinstructions like the ones
shown above, and this length is what is used in the search for the
shortest path.
In EXCHANGE4, the shortest variant of ROT is used: ROT 2->3; and the
primitives of the other variants are selected to also result in the
shortest overall code.
In EXCHANGE and EXCHANGE4, SWAP 2->3 is not the shortest variant of
SWAP, not even the shortest variant starting from state 2, but ending
in state 3 allows to use cheap variants of further primitives such as
!@ and !, resulting in the overall shortest code for this sequence.
In EXCHANGE2, we see the selection of a shorter version of SWAP, but
one of the costs is that ;s becomes longer (but in this case the
overall savings from using a shorter version of SWAP and shorter
versions of earlier instructions make up for that).
Why am I looking at this? For stack-shuffling primitives like SWAP
and ROT, it's not obvious which variant is how long and which variant
should be selected.
These stack-shiffling words therefore are also good candidates for
performing stack-cache state transitions that would otherwise require
inserting extra transition code:
E.g., EXCHANGE and its variants consume two stack items, but need to
start in stack-cache state 1 and end in stack-cache state 1 (gforth is currently not smart enough to deal with other states at basic-block boundaries), so not everything can be done in the stack cache; the
stack pointer needs to be increased by two cells, and there need to be
accesses to the memory part of the stack for two stack items.
In EXCHANGE, the adjustment by one cell and memory access for one
stack item is done in the first SWAP 2->3, and another one in the
second one. In EXCHANGE4, ROT 2->3 and SWAP 2->3 perform these tasks.
In EXCHANGE2, the SWAP 1->3 does it for one cell, and the stack-cache
state transition 0->1 in the first two instructions of ;s do it for
the other cell (gforth-fast actually does not have a built-in variant
;S 0->1 and the code shown as ;S 0->1 by SEE-CODE is actually composed
of a transition 0->1 and the ;S 1->1 variant).
I wanted to know how often which variant of these stack-shuffling
primitives is used, and how this relates to their length. One
interesting result is that ROT 1->3 is used relatively frequently
despite having relatively long code. Apparently the code that comes
before these 17 instances of ROT benefits a lot from being in
stack-cache state 1 and this amortizes the longer code for the ROT
3 compared to ROT 2->3.
Another interesting result is the low usage of SWAP 3->2 compared to
SWAP 2->3. This may say something about how SWAP is used in Forth
programs. Or it may be an artifact of tie-breaking: If two paths have
the same length, one is chosen rather arbitrarily, but consistently,
and this may make one variant appear more useful than merited by the
benefit that the existence of the variant has on code length.
For those interested, here's the code for the various variants shown
above:
r12: stack pointer
r8: stack cache register a (tos in state 1, 2nd in state 2, 3rd in state 3) r15: stack-cache register b (tos in state 2, 2nd in state 3)
r9: stack-cache register c (tos in state 3)
swap 1-1
559E7F769425: mov rax,$08[r12]
559E7F76942A: mov $08[r12],r8
559E7F76942F: mov r8,rax
swap 2-2
559E7F76E5B1: mov rax,r8
559E7F76E5B4: mov r8,r15
559E7F76E5B7: mov r15,rax
swap 3-3
559E7F76E5C3: mov rax,r15
559E7F76E5C6: mov r15,r9
559E7F76E5C9: mov r9,rax
swap 0-2
559E7F76E5D5: mov r15,$10[r12]
559E7F76E5DA: mov r8,$08[r12]
559E7F76E5DF: add r12,$10
swap 1-2
559E7F76E5EC: mov r15,$08[r12]
559E7F76E5F1: add r12,$08
swap 2-1
559E7F76E5FE: mov [r12],r15
559E7F76E602: sub r12,$08
swap 2-3
559E7F76E60F: add r12,$08
559E7F76E613: mov r9,r8
559E7F76E616: mov r8,[r12]
swap 3-2
559E7F76E623: mov [r12],r8
559E7F76E627: mov r8,r9
559E7F76E62A: sub r12,$08
swap 2-0
559E7F76E637: mov [r12],r15
559E7F76E63B: sub r12,$10
559E7F76E63F: mov $08[r12],r8
rot 1-1
559E7F76944C: mov rdx,$08[r12]
559E7F769451: mov rax,$10[r12]
559E7F769456: mov $08[r12],r8
559E7F76945B: mov $10[r12],rdx
559E7F769460: mov r8,rax
rot 3-3
559E7F76EDC0: mov rax,r8
559E7F76EDC3: mov r8,r15
559E7F76EDC6: mov r15,r9
559E7F76EDC9: mov r9,rax
rot 3-1
559E7F76EDD5: mov [r12],r15
559E7F76EDD9: sub r12,$10
559E7F76EDDD: mov $08[r12],r9
rot 2-3
559E7F76EDEB: mov r9,$08[r12]
559E7F76EDF0: add r12,$08
rot 1-3
559E7F76EDFD: mov r9,$10[r12]
559E7F76EE02: mov r15,r8
559E7F76EE05: add r12,$10
559E7F76EE09: mov r8,-$08[r12]
rot 0-3
559E7F76EE17: mov r9,$18[r12]
559E7F76EE1C: mov r8,$10[r12]
559E7F76EE21: add r12,$18
559E7F76EE25: mov r15,-$10[r12]
- anton
--
M. Anton Ertl
http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs:
http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard:
https://forth-standard.org/
EuroForth 2023 proceedings:
http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings:
http://www.euroforth.org/ef24/papers/
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)