• EXECUTE implementation in native-code systems (was: nest-sys revisited)

    From Anton Ertl@21:1/5 to dxf on Mon Mar 17 06:12:38 2025
    dxf <dxforth@gmail.com> writes:
    Would you agree 'nest-sys' are peculiar to colon definitions. That
    EXECUTE is a different class of function. It's not doing a 'call'
    as such and not leaving anything on the 'return stack'?

    That's certainly the case for threaded-code implementations.

    For native-code implementations the implementation of EXECUTE is
    usually an indirect call; sometimes an indirect tail-call, i.e. a
    jump.

    In VFX64 5.43:

    : foo execute ; ok
    see foo
    FOO
    ( 0050A250 488BD3 ) MOV RDX, RBX
    ( 0050A253 488B5D00 ) MOV RBX, [RBP]
    ( 0050A257 488D6D08 ) LEA RBP, [RBP+08]
    ( 0050A25B 48FFD2 ) CALL RDX
    ( 0050A25E C3 ) RET/NEXT

    However:

    see execute
    EXECUTE
    ( 004211B0 53 ) PUSH RBX
    ( 004211B1 488B5D00 ) MOV RBX, [RBP]
    ( 004211B5 488D6D08 ) LEA RBP, [RBP+08]
    ( 004211B9 C3 ) RET/NEXT
    ( 10 bytes, 4 instructions )

    The push-ret combination is an extremely slow form of an indirect
    jump; so where is the return address (nest-sys) here? It's the return
    address of the surrounding call. E.g., if you do

    ' + ' execute foo

    it's the call in FOO.


    SwiftForth 4.0.0-RC89:

    see foo
    4519B7 4028CB ( EXECUTE ) JMP E90F0FFBFF ok

    That's a tail-call to EXECUTE. When EXECUTE is not tail-called, the
    code of EXECUTE is invoked with call:

    : bar execute . ; ok
    see bar
    4519D3 4028CB ( EXECUTE ) CALL E8F30EFBFF
    4519D8 40B043 ( . ) JMP E96696FBFF ok

    see execute
    4028CB RBX RCX MOV 488BCB
    4028CE 0 [RBP] RBX MOV 488B5D00
    4028D2 8 [RBP] RBP LEA 488D6D08
    4028D6 4028DD JRCXZ E305
    4028D8 RDI RCX ADD 4801F9
    4028DB RCX JMP FFE1
    4028DD RET C3 ok

    This special-cases the 0 EXECUTE case as NOOP, and also adds an offset
    (the image start?) to the xt before performing the indirect jump, but
    if you ignore those parts, this EXECUTE does the same things as VFX's,
    except that it uses the much faster indirect jmp rather than push-ret.


    lxf 1.7-172-983:
    see foo
    8692BC4 8050E6E 11 88C8000 5 normal FOO

    8050E6E 8BC3 mov eax , ebx
    8050E70 8B5D00 mov ebx , [ebp]
    8050E73 8D6D04 lea ebp , [ebp+4h]
    8050E76 FFD0 call eax
    8050E78 C3 ret near

    Here the EXECUTE is compiled inline and essentially implemented as
    indirect call. lxf does not perform tail-call optimization.

    see execute
    868E2FC 88D6B47 11 88D475B 92 prim EXECUTE

    88D6B47 8BC3 mov eax , ebx
    88D6B49 8B5D00 mov ebx , [ebp]
    88D6B4C 8D6D04 lea ebp , [ebp+4h]
    88D6B4F FFD0 call eax
    88D6B51 C3 ret near

    The same code as FOO; after all, both words do the same thing.


    iForth 5.1-mini (I think):

    FORTH> ' foo idis
    $10226000 : foo 488BC04883ED088F4500 H.@H.m..E. $1022600A pop rbx 5B [
    $1022600B or rbx, rbx 4809DB H.[
    $1022600E je $10226016 offset NEAR
    0F8402000000 ...... $10226014 call rbx FFD3 .S
    $10226016 ; 488B45004883C508FFE0 H.E.H.E..` ok

    The use of call here is interesting, because iForth uses RSP as
    data-stack pointer (e.g., the "pop rbx" moves the xt into rbx) and rbp
    as return-stack pointer. Note the 10 bytes at the start of foo that
    are not shown. If I disassemble that code (into AT&T syntax), it
    looks as follows:

    0x10226000: mov %rax,%rax
    0x10226003: sub $0x8,%rbp
    0x10226007: pop 0x0(%rbp)
    0x1022600a: pop %rbx
    0x1022600b: or %rbx,%rbx
    0x1022600e: je 0x10226016
    0x10226014: call *%rbx
    0x10226016: mov 0x0(%rbp),%rax
    0x1022601a: add $0x8,%rbp
    0x1022601e: jmp *%rax

    So here we see the first and last three instructions disassembled
    (which "idis" does not do). The third instruction moves the return
    address from the RSP stack to the RBP stack, and the second
    instruction adjusts RBP for that. Note that this invocation via call
    is not the usual way to invoke a colon definition from compiled code
    in iForth. E.g.:

    FORTH> : x . . ;
    FORTH> ' x idis
    $10226940 : x 488BC04883ED088F4500 H.@H.m..E. $1022694A lea rbp, [rbp -8 +] qword
    488D6DF8 H.mx $1022694E mov [rbp 0 +] qword, $1022695B d#
    48C745005B692210 HGE.[i". $10226956 jmp .+A ( $1013888A ) offset NEAR
    E92F1FF1FF i/.q. $1022695B jmp .+A ( $1013888A ) offset NEAR
    E92A1FF1FF i*.q. $10226960 ; 488B45004883C508FFE0 H.E.H.E..`

    Note that both calls to "." jump to ".+A", i.e., they skip the first
    three instructions. The first invocation of "." pushes the return
    address explicitly in the instructions at $1022694A and $1022694E, the
    second invocation is a tail-call.

    Back to EXECUTE: This means that iForth implements EXECUTE as pushing
    the return address (in a convoluted way).


    In the general case (no-tail EXECUTE) in all these native-code systems
    a compiled EXECUTE pushes the return address.

    This is not a problem for standard code because colon definitions and does>-following code is not allowed to inspect stuff on the return
    stack that it did not push there, and because other words either don't
    access the return stack, or ticking them is non-standard (e.g., ' R@
    is non-standard).

    Could it be done without call? How would the return to the code after
    the EXECUTE happen? One way to do it would be as follows:

    The code for general (non-tail) EXECUTE:

    ... stack adjustments
    mov rax, ra
    jmp rdx # execute the xt
    ra:

    and for a constant the xt code would be:

    ... stack adjustment
    mov rbx, const
    jmp rax

    while for a colon definition the xt code would be:

    push rax
    entry: #entry point for compiled code
    ... code of the colon definition
    ret

    The disadvantage of the scheme is that it does not pair the ret with a
    call, but with a push, which leads to slow branch mispredictions. It
    seems to me that if you want to use ret for EXIT and call for compiled
    colon definitions, having a call for a non-tail EXECUTE is the most
    efficient way to go.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to Anton Ertl on Mon Mar 17 12:47:48 2025
    In article <2025Mar17.071238@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    dxf <dxforth@gmail.com> writes:
    Would you agree 'nest-sys' are peculiar to colon definitions. That
    EXECUTE is a different class of function. It's not doing a 'call'
    as such and not leaving anything on the 'return stack'?

    That's certainly the case for threaded-code implementations.

    For native-code implementations the implementation of EXECUTE is
    usually an indirect call; sometimes an indirect tail-call, i.e. a
    jump.

    In VFX64 5.43:

    : foo execute ; ok
    see foo
    FOO
    ( 0050A250 488BD3 ) MOV RDX, RBX
    ( 0050A253 488B5D00 ) MOV RBX, [RBP]
    ( 0050A257 488D6D08 ) LEA RBP, [RBP+08]
    ( 0050A25B 48FFD2 ) CALL RDX
    ( 0050A25E C3 ) RET/NEXT

    However:

    see execute
    EXECUTE
    ( 004211B0 53 ) PUSH RBX
    ( 004211B1 488B5D00 ) MOV RBX, [RBP]
    ( 004211B5 488D6D08 ) LEA RBP, [RBP+08]
    ( 004211B9 C3 ) RET/NEXT
    ( 10 bytes, 4 instructions )

    The push-ret combination is an extremely slow form of an indirect
    jump; so where is the return address (nest-sys) here? It's the return >address of the surrounding call. E.g., if you do

    ' + ' execute foo

    it's the call in FOO.


    SwiftForth 4.0.0-RC89:

    see foo
    4519B7 4028CB ( EXECUTE ) JMP E90F0FFBFF ok

    That's a tail-call to EXECUTE. When EXECUTE is not tail-called, the
    code of EXECUTE is invoked with call:

    : bar execute . ; ok
    see bar
    4519D3 4028CB ( EXECUTE ) CALL E8F30EFBFF
    4519D8 40B043 ( . ) JMP E96696FBFF ok

    see execute
    4028CB RBX RCX MOV 488BCB
    4028CE 0 [RBP] RBX MOV 488B5D00
    4028D2 8 [RBP] RBP LEA 488D6D08
    4028D6 4028DD JRCXZ E305
    4028D8 RDI RCX ADD 4801F9
    4028DB RCX JMP FFE1
    4028DD RET C3 ok

    This special-cases the 0 EXECUTE case as NOOP, and also adds an offset
    (the image start?) to the xt before performing the indirect jump, but
    if you ignore those parts, this EXECUTE does the same things as VFX's,
    except that it uses the much faster indirect jmp rather than push-ret.


    lxf 1.7-172-983:
    see foo
    8692BC4 8050E6E 11 88C8000 5 normal FOO

    8050E6E 8BC3 mov eax , ebx
    8050E70 8B5D00 mov ebx , [ebp]
    8050E73 8D6D04 lea ebp , [ebp+4h]
    8050E76 FFD0 call eax
    8050E78 C3 ret near

    Here the EXECUTE is compiled inline and essentially implemented as
    indirect call. lxf does not perform tail-call optimization.

    see execute
    868E2FC 88D6B47 11 88D475B 92 prim EXECUTE

    88D6B47 8BC3 mov eax , ebx
    88D6B49 8B5D00 mov ebx , [ebp]
    88D6B4C 8D6D04 lea ebp , [ebp+4h]
    88D6B4F FFD0 call eax
    88D6B51 C3 ret near

    The same code as FOO; after all, both words do the same thing.


    iForth 5.1-mini (I think):

    FORTH> ' foo idis
    $10226000 : foo 488BC04883ED088F4500 H.@H.m..E. >$1022600A pop rbx 5B [
    $1022600B or rbx, rbx 4809DB H.[ >$1022600E je $10226016 offset NEAR
    0F8402000000 ......
    $10226014 call rbx FFD3 .S >$10226016 ; 488B45004883C508FFE0 H.E.H.E..` ok

    The use of call here is interesting, because iForth uses RSP as
    data-stack pointer (e.g., the "pop rbx" moves the xt into rbx) and rbp
    as return-stack pointer. Note the 10 bytes at the start of foo that
    are not shown. If I disassemble that code (into AT&T syntax), it
    looks as follows:

    0x10226000: mov %rax,%rax
    0x10226003: sub $0x8,%rbp
    0x10226007: pop 0x0(%rbp)
    0x1022600a: pop %rbx
    0x1022600b: or %rbx,%rbx
    0x1022600e: je 0x10226016
    0x10226014: call *%rbx
    0x10226016: mov 0x0(%rbp),%rax
    0x1022601a: add $0x8,%rbp
    0x1022601e: jmp *%rax

    So here we see the first and last three instructions disassembled
    (which "idis" does not do). The third instruction moves the return
    address from the RSP stack to the RBP stack, and the second
    instruction adjusts RBP for that. Note that this invocation via call
    is not the usual way to invoke a colon definition from compiled code
    in iForth. E.g.:

    FORTH> : x . . ;
    FORTH> ' x idis
    $10226940 : x 488BC04883ED088F4500 H.@H.m..E. >$1022694A lea rbp, [rbp -8 +] qword
    488D6DF8 H.mx
    $1022694E mov [rbp 0 +] qword, $1022695B d#
    48C745005B692210 HGE.[i".
    $10226956 jmp .+A ( $1013888A ) offset NEAR
    E92F1FF1FF i/.q.
    $1022695B jmp .+A ( $1013888A ) offset NEAR
    E92A1FF1FF i*.q.
    $10226960 ; 488B45004883C508FFE0 H.E.H.E..`

    Note that both calls to "." jump to ".+A", i.e., they skip the first
    three instructions. The first invocation of "." pushes the return
    address explicitly in the instructions at $1022694A and $1022694E, the
    second invocation is a tail-call.

    Back to EXECUTE: This means that iForth implements EXECUTE as pushing
    the return address (in a convoluted way).


    In the general case (no-tail EXECUTE) in all these native-code systems
    a compiled EXECUTE pushes the return address.

    This is not a problem for standard code because colon definitions and >does>-following code is not allowed to inspect stuff on the return
    stack that it did not push there, and because other words either don't
    access the return stack, or ticking them is non-standard (e.g., ' R@
    is non-standard).

    Could it be done without call? How would the return to the code after
    the EXECUTE happen? One way to do it would be as follows:

    The code for general (non-tail) EXECUTE:

    ... stack adjustments
    mov rax, ra
    jmp rdx # execute the xt
    ra:

    and for a constant the xt code would be:

    ... stack adjustment
    mov rbx, const
    jmp rax

    while for a colon definition the xt code would be:

    push rax
    entry: #entry point for compiled code
    ... code of the colon definition
    ret

    The disadvantage of the scheme is that it does not pair the ret with a
    call, but with a push, which leads to slow branch mispredictions. It
    seems to me that if you want to use ret for EXIT and call for compiled
    colon definitions, having a call for a non-tail EXECUTE is the most
    efficient way to go.

    To top it off. Indirect threaded code, with RSP the data stack:

    CODE EXECUTE
    POP WOR ; working register contains pointer to header
    JMP [WOR + CODE_OFFSET] ; Assuming the code field is at CODE_OFFSET.

    In ciforth I made this offset 0 for a slight gain in efficiency:
    1224 0df8 58 POP %RAX # GET HEADER
    1225 0df9 FF20 JMP QWORD PTR[%RAX] #(IP) <- (CFA)

    DOCOL in the code field does the nesting, but maybe it is a code word like DROP.


    - anton

    Groetjes Albert
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)