• Re: Stack vs stackless operation

    From albert@spenarnc.xs4all.nl@21:1/5 to LIT on Wed Feb 26 11:48:04 2025
    In article <a81ac9ee2ed92686e940a55bed9d4dfb@www.novabbs.com>,
    LIT <zbigniew2011@gmail.com> wrote:
    In case of slower ITC non-optimizing Forths - like
    fig-Forth, as the most obvious example - the "boost"
    may be noticeable.
    I'll check that.

    code +> ( x y z -- )
    dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
    0 [bx] ax add dx bx xchg ax 0 [bx] mov next
    end-code

    Timing (adjusted for loop time):

    var1 @ var2 @ + var3 ! 8019 mS
    var1 var2 var3 +> 5657 mS

    So even in case of fast DTC Forth, like DX Forth,
    it's already something worth of closer attention,
    I believe.
    I expect even bigger gain in case of older fig-Forth
    model.

    ciforth is actually fig-Forth 5.5.3 with some ansification
    and abandonment of seventies-style tricks.

    I don't expect a gain in ciforth from this.

    --

    Groetjes Albert
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Wed Feb 26 08:30:28 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Potential alternative is
    a pair of operations, say PUSH and POP, and Forth compiler
    that replaces pair like V1 @ by PUSH(V1). Note that here
    address of V1 is intended to be part to PUSH (so it will
    take as much space as separate V1 and @, but is only a
    single primitive).

    In Gforth variables are compiled as "lit <addr>", and Gforth has a
    primitive LIT@, and a generalized constant-folding optimization that
    replaces "lit <addr> @" with "LIT@ <addr>". In the Gforth image there
    are 490 uses of lit@ (out of 33611 uses of primitives) and 76
    occurences of "lit @" (in parts that are compiled before the
    generalized constant-folding is active). There are also 293
    occurences of "lit !".

    However, given the minimal difference between the code produced for
    "LIT@" and "LIT @", LIT@ is no longer beneficial. E.g.:

    variable v ok
    : foo1 v @ + ; ok
    : foo2 v [ basic-block-end ] @ + ; ok
    see-code foo1
    $7F27184A08B0 lit@ 1->2
    $7F27184A08B8 v
    7F271806B580: mov rax,$08[rbx]
    7F271806B584: mov r15,[rax]
    $7F27184A08C0 + 2->1
    7F271806B587: add r13,r15
    $7F27184A08C8 ;s 1->1
    ...
    see-code foo2
    $7F27184A08F8 lit 1->2
    $7F27184A0900 v
    7F271806B596: mov r15,$08[rbx]
    $7F27184A0908 @ 2->2
    7F271806B59A: mov r15,[r15]
    $7F27184A0910 + 2->1
    7F271806B59D: add r13,r15
    $7F27184A0918 ;s 1->1
    ...

    More generally, a simple "optimizer" that replaces short
    sequences of Forth primitives by different, shorter sequence
    of primitives is likely to give similar gain. However,
    chance of match decreases with length of the sequence.

    Gforth has that as static superinstructions. You can see the
    sequences in
    <http://git.savannah.gnu.org/cgit/gforth.git/tree/peeprules.vmg>, in
    the lines before there is any occurence of prim-states or something
    similar. As you can see, many of the formerly-used sequences are now
    commented out, because static superinstructions do not play well with
    a) static stack caching (currently static superinstructions only work
    for the default stack cache state) and b) IP-update optimization (if
    one of the primitives in the sequence has an immediate argument (e.g.,
    LIT), you would need additional variants for various IP offsets, or
    update the IP before the sequence).

    The remaining static superinstructions

    * have to do with stacks where we do not have stack caching (FP stack,
    locals stack, return stack),

    * are combinations of comparison primitives and ?BRANCH (this avoids
    the need to reify the result of the comparison in a general-purpose
    register), or

    * are sequences of typical memory-access words (not because they occur
    so often, but because it's better to have a small number of words
    that can be combined, and a number of combinations in the optimizer
    than to have a combinatorial explosion of words).

    Above you bet on relatively long seqences (and on programmer
    writing alternative seqence). Shorter seqences have more
    chance of matching, so you need smaller number of them
    for similar gain.

    That's certainly our experience. Long sequences with high dynamic
    counts often come out of the inner loop of a single benchmark, and do
    not help other programs at all. We later preferred to go with static
    usage counts (i.e., the sequence occurs several times in the code),
    and this naturally leads to short sequences.

    One can
    do better than using machine stack, namely keeping thing in
    registers, but that means generating machine code and doing
    optimization.

    Gforth does stack caching at the level of primitives, by having
    several variants of the primitives for different start and end states
    of the primitives, and using a shortest-path search for finding out
    which combination of these variants to use. However, for multiple
    stacks this leads to a large number of states, and the shortest-path
    algorithm becomes too expensive. For now we only stack-cache the data
    stack.

    For extending this to multiple stacks, I see several alternatives:

    * Use a greedy algorithm instead of an optimal shortest-path
    algorithm. The difference is probably non-existent in most cases.

    * Manage the stack cache using register allocation techniques instead
    of representing it as an abstract state. This would often produce
    similar results as the greedy technique, but it can also handle
    stack manipulation words cheaply without having an explosion of
    stack states and the related complexity in the generator that
    generates the states and the tables for the state-handling.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to Anton Ertl on Wed Feb 26 12:04:10 2025
    In article <2025Feb25.233542@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    zbigniew2011@gmail.com (LIT) writes:
    V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms

    Too much OOS thinking? Try

    V1 @ V2 @ V1 ! V2 !

    V1 V2 :=: - 15s 260ms

    So there is a noticeable difference indeed.

    The question is how often you use these new words in applications.

    These words might make sense connected to a sorting application. 1]
    Define those words there and don't clobber the global name space.


    - anton

    1] After testing of course.

    Groetjes Albert
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to LIT on Wed Feb 26 11:23:03 2025
    On Wed, 26 Feb 2025 9:08:19 +0000, LIT wrote:

    I remain skeptical of such optimizations. Not even twice the
    performance
    and the hope it represents a bottle-neck in order to realize that gain.

    I've got a feeling it would have more
    of a significance in 8088 era, say IBM 5150
    or XTs. 486 is already "too good" probably
    to see as much as 50% gain.
    I've got working XT board - if I manage to
    get at least FDD interface for that (no,
    not today... it'll take some time) I'll
    do some more testing.

    Save yourself the time: use an emulator eg PCem, DOSBox(X) or QEMU

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to All on Wed Feb 26 12:35:48 2025
    Results for iForth64.

    The runtime of test3 is remarkable. I think not much can be done
    about it, given the context.

    -marcel

    ---

    VARIABLE V1 7 V1 !
    VARIABLE V2 8 V2 !
    VARIABLE V3 9 V3 !

    : :=: ( a b -- ) \ exchange values among two variables
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    : ++ ( a -- ) \ increment variable by one
    1 SWAP +! ;

    : +> ( a b c -- ) \ add two variables then store result into third one
    -ROT @ SWAP @ + SWAP ! ;

    : t1a V1 @ V2 @ + V3 ! ; : t1b V1 V2 V3 +> ;
    : t2a 1 V1 +! ; : t2b V1 ++ ;
    : t3a V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! ; : t3b V1 V2 :=: ;

    : TESTa S" TIMER-RESET #100000 0 DO #10000 0 DO " EVALUATE ; IMMEDIATE
    : TESTb S" LOOP LOOP 3 SPACES .ELAPSED " EVALUATE ; IMMEDIATE

    : test1 CR ." \ TEST1 : " TESTa t1a TESTb TESTa t1b TESTb ;
    : test2 CR ." \ TEST2 : " TESTa t2a TESTb TESTa t2b TESTb ;
    : test3 CR ." \ TEST3 : " TESTa t3a TESTb TESTa t3b TESTb ;

    : TESTS test1 test2 test3 ;

    TESTS
    \ TEST1 : 1.646 seconds elapsed. 1.661 seconds elapsed.
    \ TEST2 : 1.778 seconds elapsed. 1.728 seconds elapsed.
    \ TEST3 :
  • From Anton Ertl@21:1/5 to mhx on Wed Feb 26 14:32:50 2025
    mhx@iae.nl (mhx) writes:
    : :=: ( a b -- ) \ exchange values among two variables
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    <https://www.complang.tuwien.ac.at/forth/programs/sort.fs> contains:

    : exchange ( addr1 addr2 -- )
    over @ over @ >r swap ! r> swap ! ;

    Let's see if Gforth produces better code for one of them:

    see-code :=: see-code exchange
    $7FBD6B6A06A8 over 1->2 $7FBD6B6A0728 over 1->2
    7FBD6B26B3B0: mov r15,$08[r10] 7FBD6B26B3F0: mov r15,$08[r10] $7FBD6B6A06B0 @ 2->2 $7FBD6B6A0730 @ 2->2
    7FBD6B26B3B4: mov r15,[r15] 7FBD6B26B3F4: mov r15,[r15] $7FBD6B6A06B8 >r 2->1 $7FBD6B6A0738 over 2->3
    7FBD6B26B3B7: mov -$08[r14],r15 7FBD6B26B3F7: mov r9,r13 7FBD6B26B3BB: sub r14,$08 $7FBD6B6A0740 @ 3->3
    $7FBD6B6A06C0 dup 1->2 7FBD6B26B3FA: mov r9,[r9] 7FBD6B26B3BF: mov r15,r13 $7FBD6B6A0748 >r 3->2
    $7FBD6B6A06C8 @ 2->2 7FBD6B26B3FD: mov -$08[r14],r9 7FBD6B26B3C2: mov r15,[r15] 7FBD6B26B401: sub r14,$08 $7FBD6B6A06D0 rot 2->3 $7FBD6B6A0750 swap 2->3
    7FBD6B26B3C5: mov r9,$08[r10] 7FBD6B26B405: add r10,$08 7FBD6B26B3C9: add r10,$08 7FBD6B26B409: mov r9,r13 $7FBD6B6A06D8 ! 3->1 7FBD6B26B40C: mov r13,[r10] 7FBD6B26B3CD: mov [r9],r15 $7FBD6B6A0758 ! 3->1
    $7FBD6B6A06E0 r> 1->2 7FBD6B26B40F: mov [r9],r15 7FBD6B26B3D0: mov r15,[r14] $7FBD6B6A0760 r> 1->2
    7FBD6B26B3D3: add r14,$08 7FBD6B26B412: mov r15,[r14] $7FBD6B6A06E8 swap 2->3 7FBD6B26B415: add r14,$08 7FBD6B26B3D7: add r10,$08 $7FBD6B6A0768 swap 2->3
    7FBD6B26B3DB: mov r9,r13 7FBD6B26B419: add r10,$08 7FBD6B26B3DE: mov r13,[r10] 7FBD6B26B41D: mov r9,r13 $7FBD6B6A06F0 ! 3->1 7FBD6B26B420: mov r13,[r10] 7FBD6B26B3E1: mov [r9],r15 $7FBD6B6A0770 ! 3->1
    $7FBD6B6A06F8 ;s 1->1 7FBD6B26B423: mov [r9],r15 7FBD6B26B3E4: mov rbx,[r14] $7FBD6B6A0778 ;s 1->1
    7FBD6B26B3E7: add r14,$08 7FBD6B26B426: mov rbx,[r14] 7FBD6B26B3EB: mov rax,[rbx] 7FBD6B26B429: add r14,$08 7FBD6B26B3EE: jmp eax 7FBD6B26B42D: mov rax,[rbx]
    7FBD6B26B430: jmp eax

    These things are hard to predict:-)

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Wed Feb 26 13:05:35 2025
    This 1 billion times test of 3 cache cells is indeed remarkable. ;-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Wed Feb 26 17:46:13 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mhx@iae.nl (mhx) writes:
    : :=: ( a b -- ) \ exchange values among two variables
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    Another variant:

    : exchange ( addr1 addr2 -- )
    dup @ rot !@ swap ! ;

    This uses the primitive

    '!@' ( u1 a-addr -- u2 ) gforth-experimental "store-fetch"
    load U2 from A_ADDR, and store U1 there, as atomic operation

    I worry that the atomic part will result in it being slower than the
    versions that do not use !@. Let's measure that:

    : exchange ( addr1 addr2 -- )
    over @ swap !@ swap ! ;

    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    : bench-exchange ( addr1 addr2 -- )
    100000000 0 do 2dup exchange loop ;

    : bench-:=: ( addr1 addr2 -- )
    100000000 0 do 2dup :=: loop ;

    variable v1
    variable v2

    1 v1 !
    2 v2 !

    Measurement with
    perf stat -e cycles -e instructions gforth-fast xxxx.fs -e "v1 v2 bench-exchange bye"
    perf stat -e cycles -e instructions gforth-fast xxxx.fs -e "v1 v2 bench-:=: bye"

    Results on a Zen4:

    exchange :=:
    877_054_156 812_761_422 cycles
    3_708_692_329 3_908_642_117 instructions

    So the @! variant is indeed slower, but only a little (0.65 cycles per execution of these words); however, I would expect either a big
    slowdown (from latency when dealing with the memory subsystem,
    broadcasting to other cores, etc.) or none at all.

    And here's the code:
    see-code exchange see-code :=:
    $7EFDC12A06A8 over 1->2 $7FBD6B6A06A8 over 1->2
    7EFDC0DEA3B0: mov r15,$08[r10] 7FBD6B26B3B0: mov r15,$08[r10] $7EFDC12A06B0 @ 2->2 $7FBD6B6A06B0 @ 2->2
    7EFDC0DEA3B4: mov r15,[r15] 7FBD6B26B3B4: mov r15,[r15] $7EFDC12A06B8 swap 2->1 $7FBD6B6A06B8 >r 2->1
    7EFDC0DEA3B7: mov [r10],r15 7FBD6B26B3B7: mov -$08[r14],r15 7EFDC0DEA3BA: sub r10,$08 7FBD6B26B3BB: sub r14,$08 $7EFDC12A06C0 !@ 1->1 $7FBD6B6A06C0 dup 1->2
    7EFDC0DEA3BE: mov rax,$08[r10] 7FBD6B26B3BF: mov r15,r13 7EFDC0DEA3C2: add r10,$08 $7FBD6B6A06C8 @ 2->2
    7EFDC0DEA3C6: xchg $00[r13],rax 7FBD6B26B3C2: mov r15,[r15] 7EFDC0DEA3CA: mov r13,rax $7FBD6B6A06D0 rot 2->3
    $7EFDC12A06C8 swap 1->2 7FBD6B26B3C5: mov r9,$08[r10] 7EFDC0DEA3CD: mov r15,$08[r10] 7FBD6B26B3C9: add r10,$08 7EFDC0DEA3D1: add r10,$08 $7FBD6B6A06D8 ! 3->1
    $7EFDC12A06D0 ! 2->0 7FBD6B26B3CD: mov [r9],r15 7EFDC0DEA3D5: mov [r15],r13 $7FBD6B6A06E0 r> 1->2
    $7EFDC12A06D8 ;s 0->1 7FBD6B26B3D0: mov r15,[r14] 7EFDC0DEA3D8: mov r13,$08[r10] 7FBD6B26B3D3: add r14,$08 7EFDC0DEA3DC: add r10,$08 $7FBD6B6A06E8 swap 2->3
    7EFDC0DEA3E0: mov rbx,[r14] 7FBD6B26B3D7: add r10,$08 7EFDC0DEA3E3: add r14,$08 7FBD6B26B3DB: mov r9,r13 7EFDC0DEA3E7: mov rax,[rbx] 7FBD6B26B3DE: mov r13,[r10] 7EFDC0DEA3EA: jmp eax $7FBD6B6A06F0 ! 3->1
    7FBD6B26B3E1: mov [r9],r15
    $7FBD6B6A06F8 ;s 1->1
    7FBD6B26B3E4: mov rbx,[r14]
    7FBD6B26B3E7: add r14,$08
    7FBD6B26B3EB: mov rax,[rbx]
    7FBD6B26B3EE: jmp eax

    The difference looks bigger than it is: There are lines for 4
    additional primitives (no influence on performance) and 2 additional instructions, resulting in a 6-line difference.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to Anton Ertl on Wed Feb 26 11:44:15 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    : ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;

    looks a little simpler.

    The effort of implementing special native words for this though are
    probably better spent on locals.

    : ex {: x y -- :} x @ y @ x ! y ! ;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Paul Rubin on Thu Feb 27 07:29:44 2025
    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    : ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;

    looks a little simpler.

    This inspires another one:

    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;

    With some other versions this results in the following benchmark
    program:

    [defined] !@ [if]
    : exchange ( addr1 addr2 -- )
    over @ swap !@ swap ! ;
    [then]

    \ Paul Rubin <875xkwo5io.fsf@nightsong.com>
    : ex ( addr1 addr2 -- )
    2>r 2r@ @ swap @ r> ! r> ! ;

    : ex-locals {: x y -- :} x @ y @ x ! y ! ;

    \ Anton Ertl
    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;

    \ Marcel Hendrix
    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    variable v1
    variable v2

    1 v1 !
    2 v2 !

    : bench ( "name" -- )
    v1 v2
    :noname ]] 100000000 0 do 2dup [[ parse-name evaluate ]] loop ; [[
    execute ;

    Results (on Zen4):

    gforth-fast (development):
    :=: exchange ex ex-locals exchange2
    814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst.

    vfx64 5.43:
    :=: ex ex-locals exchange2
    335_298_202 432_614_804 928_542_678 336_134_513 cyc. 1_166_400_242 1_366_264_943 2_866_547_067 1_166_280_641 inst.

    And here's the code produced by gforth-fast:

    :=: ex ex-locals exchange2
    over 1->2 2>r 1->0 l 1->1 dup >r 1->1
    mov r15,$08[r10] add r10,$08 mov rax,rbp >r 1->1
    @ 2->2 mov r15,r13 add r10,$08 mov -$8[r14],r13
    mov r15,[r15] mov r13,[r10] lea rbp,-$8[rbp] sub r14,$08
    r 2->1 mov -$8[r14],r13 mov -$8[rax],r13 @ 1->1
    mov -$8[r14],r15 sub r14,$10 mov r13,[r10] mov r13,$00[r13]
    sub r14,$08 mov [r14],r15 >l @local0 1->1 over 1->2
    dup 1->2 2r@ 0->2 @local0 1->1 mov r15,$08[r10]
    mov r15,r13 mov r13,$08[r14] mov rax,rbp @ 2->2
    @ 2->2 mov r15,[r14] lea rbp,-$8[rbp] mov r15,[r15]
    mov r15,[r15] @ 2->2 mov -$8[rax],r13 r> 2->3
    rot 2->3 mov r15,[r15] @ 1->1 mov r9,[r14]
    mov r9,$08[r10] swap 2->2 mov r13,$00[r13] add r14,$08
    add r10,$08 mov rax,r13 @local1 1->2 ! 3->1
    ! 3->1 mov r13,r15 mov r15,$08[rbp] mov [r9],r15
    mov [r9],r15 mov r15,rax @ 2->2 swap 1->2
    1->2 @ 2->2 mov r15,[r15] mov r15,$08[r10]
    mov r15,[r14] mov r15,[r15] @local0 2->3 add r10,$08
    add r14,$08 r> 2->3 mov r9,$00[rbp] ! 2->0
    swap 2->3 mov r9,[r14] ! 3->1 mov [r15],r13
    add r10,$08 add r14,$08 mov [r9],r15 ;s 0->1
    mov r9,r13 ! 3->1 @local1 1->2 mov r13,$08[r10]
    mov r13,[r10] mov [r9],r15 mov r15,$08[rbp] add r10,$08
    ! 3->1 r> 1->2 ! 2->0 mov rbx,[r14]
    mov [r9],r15 mov r15,[r14] mov [r15],r13 add r14,$08
    ;s 1->1 add r14,$08 lp+2 0->1 mov rax,[rbx]
    mov rbx,[r14] ! 2->0 mov r13,$08[r10] jmp eax
    add r14,$08 mov [r15],r13 add r10,$08
    mov rax,[rbx] ;s 0->1 add rbp,$10
    jmp eax mov r13,$08[r10] ;s 1->1
    add r10,$08 mov rbx,[r14]
    mov rbx,[r14] add r14,$08
    add r14,$08 mov rax,[rbx]
    mov rax,[rbx] jmp eax
    jmp eax

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Wed Feb 26 21:02:52 2025
    Really? ;-)

    NT/FORTH (C) 2005 Peter Fälth Version 1.6-983-824 Compiled on
    2017-12-03
    Running on Windows NT 6.2 Build 9200
    Current directory is e:\Develop\Forth\lxf
    : ex {: x y -- :} x @ y @ x ! y ! ; ok
    see ex
    A49E58 40917C 23 C80000 5 normal EX

    40917C 8B4500 mov eax , [ebp]
    40917F 8B00 mov eax , [eax]
    409181 8BCB mov ecx , ebx
    409183 8B09 mov ecx , [ecx]
    409185 8B5500 mov edx , [ebp]
    409188 890A mov [edx] , ecx
    40918A 8903 mov [ebx] , eax
    40918C 8B5D04 mov ebx , [ebp+4h]
    40918F 8D6D08 lea ebp , [ebp+8h]
    409192 C3 ret near
    ok
    : :=: OVER @ >R DUP @ ROT ! R> SWAP ! ; ok
    see :=:
    A49E6C 409193 23 C80000 5 normal :=:

    409193 8B4500 mov eax , [ebp]
    409196 8B00 mov eax , [eax]
    409198 8BCB mov ecx , ebx
    40919A 8B09 mov ecx , [ecx]
    40919C 8B5500 mov edx , [ebp]
    40919F 890A mov [edx] , ecx
    4091A1 8903 mov [ebx] , eax
    4091A3 8B5D04 mov ebx , [ebp+4h]
    4091A6 8D6D08 lea ebp , [ebp+8h]
    4091A9 C3 ret near
    ok

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to All on Thu Feb 27 17:47:08 2025
    An even weirder result for TEST3, although it probably has more to do
    with my aging DO LOOP construct.

    -marcel

    ---
    ANEW -oos

    VARIABLE V1 7 V1 !
    VARIABLE V2 8 V2 !
    VARIABLE V3 9 V3 !

    : :=: ( a b -- ) \ exchange values among two variables
    PARAMS| a b | a @ b @ swap b ! a ! ;

    : ++ ( a -- ) \ increment variable by one
    1 SWAP +! ;

    : +> ( a b c -- ) \ add two variables then store result into third one
    PARAMS| a b c | a @ b @ + c ! ;

    : t1a V1 @ V2 @ + V3 ! ; : t1b V1 V2 V3 +> ;
    : t2a 1 V1 +! ; : t2b V1 ++ ;
    : t3a V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! ; : t3b V1 V2 :=: ;

    : TESTS
    CR ." \ TEST1 : " TIMER-RESET #1000000000 0 DO t1a t1a t1a t1a t1a
    t1a t1a t1a t1a t1a LOOP .ELAPSED
    3 SPACES TIMER-RESET #1000000000 0 DO t1b t1b t1b t1b t1b
    t1b t1b t1b t1b t1b LOOP .ELAPSED
    CR ." \ TEST2 : " TIMER-RESET #1000000000 0 DO t2a t2a t2a t2a t2a
    t2a t2a t2a t2a t2a LOOP .ELAPSED
    3 SPACES TIMER-RESET #1000000000 0 DO t2b t2b t2b t2b t2b
    t2b t2b t2b t2b t2b LOOP .ELAPSED
    CR ." \ TEST3 : " TIMER-RESET #1000000000 0 DO t3a t3a t3a
  • From Paul Rubin@21:1/5 to Anton Ertl on Thu Feb 27 12:23:43 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Results (on Zen4):
    gforth-fast (development): ...

    It's interesting how little difference there is with gforth-fast. Could
    you also do gforth-itc? exchange2 is a big win with VFX, suggesting its optimizer could do better with some of the other versions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Gerry Jackson on Thu Feb 27 22:53:47 2025
    Gerry Jackson <do-not-use@swldwa.uk> writes:
    On 27/02/2025 07:29, Anton Ertl wrote:
    \ Anton Ertl
    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;
    ...
    Results (on Zen4):

    gforth-fast (development):
    :=: exchange ex ex-locals exchange2
    814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. >> 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst. ...
    How does a crude definition not involving the R stack compare:
    : ex3 over @ over @ 3 pick ! over ! 2drop ;

    exchange2 ex3
    dup >r 1->1 over 1->1
    r 1->1 mov [r10],r13
    mov -$08[r14],r13 sub r10,$08
    sub r14,$08 mov r13,$10[r10]
    @ 1->1 @ 1->1
    mov r13,$00[r13] mov r13,$00[r13]
    over 1->2 over 1->2
    mov r15,$08[r10] mov r15,$08[r10]
    @ 2->2 @ 2->2
    mov r15,[r15] mov r15,[r15]
    2->3 fourth 2->3
    mov r9,[r14] mov r9,$10[r10]
    add r14,$08 ! 3->1
    ! 3->1 mov [r9],r15
    mov [r9],r15 over 1->2
    swap 1->2 mov r15,$08[r10]
    mov r15,$08[r10] ! 2->0
    add r10,$08 mov [r15],r13
    ! 2->0 2drop 0->0
    mov [r15],r13 add r10,$10
    ;s 0->1 ;s 0->1
    mov r13,$08[r10] mov r13,$08[r10]
    add r10,$08 add r10,$08
    mov rbx,[r14] mov rbx,[r14]
    add r14,$08 add r14,$08
    mov rax,[rbx] mov rax,[rbx]
    jmp eax jmp eax

    EX3 plays to Gforth's strengths: copying words (e.g., OVER) instead of shuffling words (e.g., SWAP), remove superfluous stuff with 2DROP.

    It also plays to VFX's strengths: being analytic about the dats stack. EXCHANGE2 was the fastest version (together with :=:) before, here's
    that compared to EX3:

    exchange2 ex3
    334_718_398 273_592_214 cycles
    1_167_276_392 967_258_380 instructions

    EXCHANGE2 EX3
    PUSH RBX MOV RDX, [RBP]
    MOV RDX, [RBP] MOV RDX, 0 [RDX]
    MOV RDX, 0 [RDX] MOV RCX, 0 [RBX]
    POP RCX MOV RAX, [RBP]
    MOV RBX, 0 [RBX] MOV 0 [RAX], RCX
    MOV 0 [RCX], RDX MOV 0 [RBX], RDX
    MOV RDX, [RBP] MOV RBX, [RBP+08]
    MOV 0 [RDX], RBX LEA RBP, [RBP+10]
    MOV RBX, [RBP+08] RET/NEXT
    LEA RBP, [RBP+10] ( 29 bytes, 9 instructions )
    RET/NEXT
    ( 31 bytes, 11 instructions )

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gerry Jackson@21:1/5 to Anton Ertl on Thu Feb 27 22:05:09 2025
    On 27/02/2025 07:29, Anton Ertl wrote:
    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    : ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;

    looks a little simpler.

    This inspires another one:

    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;

    With some other versions this results in the following benchmark
    program:

    [defined] !@ [if]
    : exchange ( addr1 addr2 -- )
    over @ swap !@ swap ! ;
    [then]

    \ Paul Rubin <875xkwo5io.fsf@nightsong.com>
    : ex ( addr1 addr2 -- )
    2>r 2r@ @ swap @ r> ! r> ! ;

    : ex-locals {: x y -- :} x @ y @ x ! y ! ;

    \ Anton Ertl
    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;

    \ Marcel Hendrix
    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    variable v1
    variable v2

    1 v1 !
    2 v2 !

    : bench ( "name" -- )
    v1 v2
    :noname ]] 100000000 0 do 2dup [[ parse-name evaluate ]] loop ; [[
    execute ;

    Results (on Zen4):

    gforth-fast (development):
    :=: exchange ex ex-locals exchange2
    814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst.

    vfx64 5.43:
    :=: ex ex-locals exchange2
    335_298_202 432_614_804 928_542_678 336_134_513 cyc. 1_166_400_242 1_366_264_943 2_866_547_067 1_166_280_641 inst.

    And here's the code produced by gforth-fast:

    :=: ex ex-locals exchange2
    over 1->2 2>r 1->0 l 1->1 dup >r 1->1
    mov r15,$08[r10] add r10,$08 mov rax,rbp >r 1->1
    @ 2->2 mov r15,r13 add r10,$08 mov -$8[r14],r13
    mov r15,[r15] mov r13,[r10] lea rbp,-$8[rbp] sub r14,$08
    r 2->1 mov -$8[r14],r13 mov -$8[rax],r13 @ 1->1
    mov -$8[r14],r15 sub r14,$10 mov r13,[r10] mov r13,$00[r13]
    sub r14,$08 mov [r14],r15 >l @local0 1->1 over 1->2
    dup 1->2 2r@ 0->2 @local0 1->1 mov r15,$08[r10]
    mov r15,r13 mov r13,$08[r14] mov rax,rbp @ 2->2
    @ 2->2 mov r15,[r14] lea rbp,-$8[rbp] mov r15,[r15]
    mov r15,[r15] @ 2->2 mov -$8[rax],r13 r> 2->3
    rot 2->3 mov r15,[r15] @ 1->1 mov r9,[r14]
    mov r9,$08[r10] swap 2->2 mov r13,$00[r13] add r14,$08
    add r10,$08 mov rax,r13 @local1 1->2 ! 3->1
    ! 3->1 mov r13,r15 mov r15,$08[rbp] mov [r9],r15
    mov [r9],r15 mov r15,rax @ 2->2 swap 1->2
    1->2 @ 2->2 mov r15,[r15] mov r15,$08[r10]
    mov r15,[r14] mov r15,[r15] @local0 2->3 add r10,$08
    add r14,$08 r> 2->3 mov r9,$00[rbp] ! 2->0
    swap 2->3 mov r9,[r14] ! 3->1 mov [r15],r13
    add r10,$08 add r14,$08 mov [r9],r15 ;s 0->1
    mov r9,r13 ! 3->1 @local1 1->2 mov r13,$08[r10]
    mov r13,[r10] mov [r9],r15 mov r15,$08[rbp] add r10,$08
    ! 3->1 r> 1->2 ! 2->0 mov rbx,[r14]
    mov [r9],r15 mov r15,[r14] mov [r15],r13 add r14,$08
    ;s 1->1 add r14,$08 lp+2 0->1 mov rax,[rbx]
    mov rbx,[r14] ! 2->0 mov r13,$08[r10] jmp eax
    add r14,$08 mov [r15],r13 add r10,$08
    mov rax,[rbx] ;s 0->1 add rbp,$10
    jmp eax mov r13,$08[r10] ;s 1->1
    add r10,$08 mov rbx,[r14]
    mov rbx,[r14] add r14,$08
    add r14,$08 mov rax,[rbx]
    mov rax,[rbx] jmp eax
    jmp eax


    How does a crude definition not involving the R stack compare:
    : ex3 over @ over @ 3 pick ! over ! 2drop ;

    --
    Gerry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Paul Rubin on Thu Feb 27 22:03:55 2025
    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Results (on Zen4):
    gforth-fast (development): ...

    It's interesting how little difference there is with gforth-fast. Could
    you also do gforth-itc?

    gforth-itc (development):
    :=: exchange ex ex-locals exchange2
    7_527_256_553 5_224_615_325 6_825_283_178 9_238_357_501 7_036_128_309 c. 13_127_503_990 9_326_561_471 12_927_054_153 16_927_820_825 12_027_146_677 i.

    For comparison: gforth-fast (development):
    :=: exchange ex ex-locals exchange2
    814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst.

    exchange2 is a big win with VFX, suggesting its
    optimizer could do better with some of the other versions.

    On VFX exchange2 takes the same speed and the same number of
    instructions as :=:. EX is slower because VFX does not analyse the
    return stack, unlike the data stack. EX-LOCALS is slow because VFX's
    locals implementation is not particularly good.

    To see what a better analysis can do, let's look at lxf:

    :=: ex ex-locals exchange2
    502_740_029 502_189_567 502_134_842 502_043_217 cycles
    1_701_663_782 1_701_657_866 1_701_677_273 1_701_684_186 instructions

    The cycles and instructions are worse (except for ex-locals) than with
    VFX, but that's due to inlining (which VFX does and lxf does not).

    E.g., here's lxf's code for EX-LOCALS:

    869204C 804FCE2 23 88C8000 5 normal EX-LOCALS

    804FCE2 8B4500 mov eax , [ebp]
    804FCE5 8B00 mov eax , [eax]
    804FCE7 8BCB mov ecx , ebx
    804FCE9 8B09 mov ecx , [ecx]
    804FCEB 8B5500 mov edx , [ebp]
    804FCEE 890A mov [edx] , ecx
    804FCF0 8903 mov [ebx] , eax
    804FCF2 8B5D04 mov ebx , [ebp+4h]
    804FCF5 8D6D08 lea ebp , [ebp+8h]
    804FCF8 C3 ret near

    It's the same code as lxf produces for :=:.

    The code lxf produces for EX and EXCHANGE2 is:

    804FCF9 8BC3 mov eax , ebx
    804FCFB 8B00 mov eax , [eax]
    804FCFD 8B4D00 mov ecx , [ebp]
    804FD00 8B09 mov ecx , [ecx]
    804FD02 890B mov [ebx] , ecx
    804FD04 8B5D00 mov ebx , [ebp]
    804FD07 8903 mov [ebx] , eax
    804FD09 8B5D04 mov ebx , [ebp+4h]
    804FD0C 8D6D08 lea ebp , [ebp+8h]
    804FD0F C3 ret near

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Fri Feb 28 21:55:05 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Another variant:

    : exchange ( addr1 addr2 -- )
    dup @ rot !@ swap ! ;

    This uses the primitive

    '!@' ( u1 a-addr -- u2 ) gforth-experimental "store-fetch"
    load U2 from A_ADDR, and store U1 there, as atomic operation

    I worry that the atomic part will result in it being slower than the
    versions that do not use !@.

    It's barely noticable on Zen4, but it makes a big difference on the
    Cortex-A55. Therefore we decided to also have a nonatomic !@. We renamed the atomic one into ATOMIC!@ and !@ is now the nonatomic version.

    How do they perform?

    On Zen4:
    !@ atomic!@
    821_538_216 880_459_702 cycles
    3_815_202_629 3_710_937_849 instructions

    On Cortex-A55:
    !@ atomic!@
    3355427045 5856496676 cycles
    3115589778 4318749543 instructions

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Paul Rubin on Sat Mar 1 07:32:09 2025
    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    !@ is now the nonatomic version.

    Is the nonatomic one useful often?

    Some numbers of uses in the Gforth image:

    11 !@
    3 atomic!@
    66 +!

    We've done without it all this time.

    Sure, you can replace it with DUP @ >R ! R>. Having a word for that
    relieves the programmer of producing such a sequence (possibly with a
    bug) and the reader of having to analyse what's going on here.

    I have now added stack-state variants for !@, resulting in better
    performance in some cases. Is !@ used often enough to merit the extra
    build time of Gforth? That's not clear, but the benefit I see is that
    I want to provide a system where the programmer does not have to
    wonder whether he should avoid !@ for better performance.

    I also tried out another variant that uses !@:

    : exchange4 ( addr1 addr2 -- )
    dup @ rot !@ swap ! ;

    The resulting code for EXCHANGE, EXCHANGE4, and EXCHANGE2 (the latter
    without !@):

    see-code exchange see-code exchange4 see-code exchange2
    over 1->2 dup 1->2 dup >r 1->1
    mov r15,$08[r12] mov r15,r8 >r 1->1
    @ 2->2 @ 2->2 mov -$08[r13],r8
    mov r15,[r15] mov r15,[r15] sub r13,$08
    swap 2->3 rot 2->3 @ 1->1
    add r12,$08 mov r9,$08[r12] mov r8,[r8]
    mov r9,r8 add r12,$08 over 1->2
    mov r8,[r12] !@ 3->2 mov r15,$08[r12]
    !@ 3->2 mov rax,r15 @ 2->2
    mov rax,r15 mov r15,[r9] mov r15,[r15]
    mov r15,[r9] mov [r9],rax r> 2->3
    mov [r9],rax swap 2->3 mov r9,$00[r13]
    swap 2->3 add r12,$08 add r13,$08
    add r12,$08 mov r9,r8 ! 3->1
    mov r9,r8 mov r8,[r12] mov [r9],r15
    mov r8,[r12] ! 3->1 swap 1->2
    ! 3->1 mov [r9],r15 mov r15,$08[r12]
    mov [r9],r15 ;s 1->1 add r12,$08
    ;s 1->1 mov rbx,$00[r13] ! 2->0
    mov rbx,$00[r13] add r13,$08 mov [r15],r8
    add r13,$08 mov rax,[rbx] ;s 0->1
    mov rax,[rbx] jmp eax mov r8,$08[r12]
    jmp eax add r12,$08
    mov rbx,$00[r13]
    add r13,$08
    mov rax,[rbx]
    jmp eax

    EXCHANGE performs 1 instruction less than EXCHANGE2, EXCHANGE4
    performs 2 instructions less than EXCHANGE2; both contain three less primitives.

    Performance on Zen4:
    exchange exchange4 exchange2
    748_033_428 699_870_875 809_204_577 cycles
    3_610_871_416 3_510_578_833 3_710_662_751 instructions

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Sat Mar 1 11:47:54 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Paul Rubin <no.email@nospam.invalid> writes: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    !@ is now the nonatomic version.

    Is the nonatomic one useful often?

    Some numbers of uses in the Gforth image:

    11 !@
    3 atomic!@
    66 +!

    We've done without it all this time.

    Sure, you can replace it with DUP @ >R ! R>. Having a word for that
    relieves the programmer of producing such a sequence (possibly with a
    bug) and the reader of having to analyse what's going on here.

    Another point: These 11 uses of non-atomic !@ used to be uses of the
    slow atomic !@. So even the slow atomic !@ was preferred by the
    programmer to doing it with @, ! and stack manipulation. In that
    situation the non-atomic !@ provides the wanted capability without
    incurring the atomic slowness.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Hans Bezemer on Sat Mar 1 17:22:45 2025
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    On 01-03-2025 12:47, Anton Ertl wrote:
    11 !@
    3 atomic!@
    66 +!

    We've done without it all this time.

    Sure, you can replace it with DUP @ >R ! R>. Having a word for that
    relieves the programmer of producing such a sequence (possibly with a
    bug) and the reader of having to analyse what's going on here.

    I found the sequence exactly twice in my code

    Yes, you can replace !@ with that sequence, but not every case where
    one fetches one value from and address and stores another value to
    that address is expressed by this sequence. E.g., another equivalent
    sequence is: DUP >R @ SWAP R> !; and another: DUP @ -ROT !. And you
    can also use that word profitably in cases where some other
    functionality is mixed in with the code without !@. E.g., in none of
    the variants of :=:/etc. without !@ in this thread one of the two
    sequences occured; in several of them the ! of the other address was
    inserted before the ! of the address that was fetched the second time.
    E.g.,

    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;

    Yet

    : exchange ( addr1 addr2 -- )
    over @ swap !@ swap ! ;

    is shorter, easier to follow, and (in gforth-fast) faster.

    As mentioned, Bernd Paysan used !@ 11 times in the Gforth image in
    code where atomicity is not needed. Up to yesterday we only had the
    atomic version and I have avoided using !@ because I was worried that
    it would be slow, so there may be some additional opportunity in the
    Gforth image for using it.

    However, if it is that rare there is no point in adding it. Creating too
    many superfluous abstractions may even get counter productive in the
    sense that predefined abstractions are ignored and reinvented.

    In that case they are obviously not superfluous. Yes, reinvention
    happens; it shows that the word is needed. Then at some point
    somebody notices the duplication, decides on a canonical version and
    goes through the code and replaces all uses of the duplicated words
    with the canonical version.

    There is a valid reason to avoid rarely used words that can be
    replaced by a sequence: human memory load. I don't think that !@ is
    such a case, though.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mhx@21:1/5 to All on Sat Mar 1 21:35:16 2025
    I can't find `DUP @ >R ! R>` (+ variants with spacings)
    in any of 1667 files.
    However, `DUP @ >R` is found 12 times and `! R>` 29 times.

    `DUP @ -ROT !` gets hit 0 times, `DUP >R @ SWAP R> !` once.

    -marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to Anton Ertl on Fri Feb 28 14:45:14 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    !@ is now the nonatomic version.

    Is the nonatomic one useful often? We've done without it all this time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Mon Feb 24 20:34:25 2025
    An optimising Forth compiler does exactly that.

    NT/FORTH for example:

    : +> rot @ rot @ + swap ! ; ok
    see +>
    A49E6C 409196 21 C80000 5 normal +>

    409196 8B4504 mov eax , [ebp+4h]
    409199 8B00 mov eax , [eax]
    40919B 8B4D00 mov ecx , [ebp]
    40919E 8B09 mov ecx , [ecx]
    4091A0 01C8 add eax , ecx
    4091A2 8903 mov [ebx] , eax
    4091A4 8B5D08 mov ebx , [ebp+8h]
    4091A7 8D6D0C lea ebp , [ebp+Ch]
    4091AA C3 ret near
    ok

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to LIT on Mon Feb 24 21:50:21 2025
    zbigniew2011@gmail.com (LIT) writes:
    I wonder: wouldn't it be useful to have stackless basic
    arith operations? I mean instead of fetching the values
    first and putting them on the stack, then doing something,
    and in the end storing the result somewhere wouldn't it
    be practical to use directly the variables?

    I don't remember ever doing that, so no, it would not be practical.

    Forth has had values for quite a while, so you could avoid the need to
    write @ and !; you would instead write something like:

    a b + to c

    For global values the code is typically not better than when using
    variables, though.

    But after I came up with this idea I realized someone
    surely invented that before - it looks so obvious — yet
    I didn't see it anywhere.

    The VAX architecture has instructions with three memory operands,
    including ADD3. That feature makes it pretty hard to implement
    efficiently.

    Did anyone of you see something
    like this in any code?

    No.

    If so — actually why somehow
    (probably?) such solution has not become widespread?

    Probably because the case where the two operands of a + are in memory,
    and the result is needed in memory is not that frequent.

    Looks good to me; math can be done completely in ML
    avoiding "Forth machine" engagement, therefore saving many
    cycles.

    Not sure what you mean with "Forth machine engagement"; with good
    Forth compilers these days, a typical stack-to-stack addition is
    faster than the best machine code for a memory-to-memory addition.
    E.g. VFX64 turns

    : dec-u#b ( u1 -- u2 )
    dup #-3689348814741910323 um* nip 3 rshift tuck 10 * - '0' + hold ; ok

    into

    ( 0050A300 48BACDCCCCCCCCCCCCCC ) MOV RDX, # CCCCCCCC:CCCCCCCD
    ( 0050A30A 488BC2 ) MOV RAX, RDX
    ( 0050A30D 48F7E3 ) MUL RBX
    ( 0050A310 48C1EA03 ) SHR RDX, # 03
    ( 0050A314 486BCA0A ) IMUL RCX, RDX, # 0A
    ( 0050A318 482BD9 ) SUB RBX, RCX
    ( 0050A31B 4883C330 ) ADD RBX, # 30
    ( 0050A31F 488D6DF8 ) LEA RBP, [RBP+-08]
    ( 0050A323 48895500 ) MOV [RBP], RDX
    ( 0050A327 E87CA7F1FF ) CALL 00424AA8 HOLD
    ( 0050A32C C3 ) RET/NEXT
    ( 45 bytes, 11 instructions )

    I don't think that it would be faster or shorter to use
    memory-to-memory operations here. That's also why the VAX died: RISCs
    just outperformed it.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Mon Feb 24 21:51:26 2025
    With respect, the more important questions are:
    For what type of machine?
    Desktop or embedded?
    Minimal kernel only or full standard compliant?
    Hobby or professional support/service required?

    But to mention another example:
    https://mecrisp.sourceforge.net/#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to LIT on Tue Feb 25 07:26:58 2025
    zbigniew2011@gmail.com (LIT) writes:
    Probably because the case where the two operands
    of a + are in memory, and the result is needed
    in memory is not that frequent.

    One example could be matrix multiplication.
    It's rather trivial but cumbersome operation,
    where usually a few transitional variables are
    used to maintain clarity of the code.

    Earlier you wrote about performance, now you switch to clarity of the
    code. What is the goal?

    If we stick with performance, the fastest version in <http://theforth.net/package/matmul/current-view/matmul.4th> on all
    systems (which I measured and that does not use a primitive FAXPY) is
    version 2, and that spends most of its time in:

    : faxpy-nostride ( ra f_x f_y ucount -- )
    \ vy=ra*vx+vy
    dup >r 3 and 0 ?do
    fdup over f@ f* dup f+! float+ swap float+ swap
    loop
    r> 2 rshift 0 ?do
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    loop
    2drop fdrop ;

    It's not the clearest code, and certainly the version without
    unrolling is clearer (and may be almost as fast in the newer versions
    of SwiftForth and VFX which make counted loops significantly faster):

    : faxpy-nostride ( ra f_x f_y ucount -- )
    \ vy=ra*vx+vy
    0 ?do
    fdup over f@ f* dup f+! float+ swap float+ swap
    loop
    2drop fdrop ;

    Each iteration performs 2 FP loads and 1 FP store. With
    memory-to-memory variants of F* and F+ that would be 4 FP loads and 2
    FP stores, and I don't think it would be any clearer. And if you use memory-to-memory variants of the address computation, things would
    become even slower. And I doubt that they would become clearer.

    Some time later I worked on how SIMD could be integrated into Forth,
    and used matrix multiplication as an example. With the wordset I
    propose this whole loop became

    ( v1 r addr ) v@ f*vs f+v ( v2 )

    Only one memory access is visible here at all; there are some more in
    the implementation of these words, however. You can find the paper
    about that at <http://www.euroforth.org/ef17/papers/ertl.pdf>. A
    further refinement of that work can be found at <https://www.complang.tuwien.ac.at/papers/ertl18manlang.pdf>
    (presented in a Java setting for the audience of the conference, but
    the implementation was in a Forth setting, see <https://github.com/AntonErtl/vectors>). This work eliminates many of
    the memory accesses that the earlier implementation performs,
    demonstrating that the memory accesses are not fundamental in the
    model. In particular, Figure 11 shows code corresponding to

    ( v1 r1 addr1 r2 addr2 ) v@ f*vs v@ f+v v@ f*vs f+v ( v2 )

    i.e., the code above unrolled by a factor of 2; it has 3 SIMD loads
    and 1 SIMD store per SIMD-granule processed (the SIMD granule is 4
    doubles for AVX). Further unrolling results in even fewer loads and
    stores per FLOP (FP multiplication and FP addition).

    Probably "bigger" Forth compilers are indeed
    already "too good" for the difference to be
    (practically) noticeable — still maybe for
    simpler Forths, I mean like the ones for DOS
    or even for 8-bit machines it would make sense?

    Forth was designed for small machines and very simple implementations.
    We have words like "1+" that are beneficial in that setting. We also
    have "+!", which is the closest to what you have in mind. But even in
    those times nobody went for a word like "+> ( addr1 addr2 addr3 -- )",
    because it is not useful often enough.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to LIT on Tue Feb 25 09:07:19 2025
    zbigniew2011@gmail.com (LIT) writes:
    [Anton Ertl:]
    Earlier you wrote about performance, now you switch to clarity of the
    code. What is the goal?

    Both — one isn't contrary to another.

    Sometimes the clearer code is slower and the faster code is less clear
    (as in the FAXPY-NOSTRIDE example).

    What I have in mind is: by performing OOS operation
    we don't have to employ the whole "Forth machine" to
    do the usual things (I mean, of course, the usual
    steps described by Brad Rodriguez in his "Moving
    Forth" paper).

    What does "OOS" stand for? What do you mean with "the usual steps"; I
    am not going to read the whole paper and guess which of the code shown
    there you have in mind.

    It comes with a cost: usual Forth words, that use
    the stack, are versatile, while such OOS words
    aren't that versatile anymore — yet (at least in
    the case of ITC non-optimizing Forths) they should
    be faster.

    One related thing is the work on "register"-based virtual machines.
    For interpreted implementations the VM registers are in memory, but
    they are accessed by "register number"; these usually correspond to
    locals slots on machines line the JavaVM. A well-known example of
    that is the switch of the Lua VM from stack-based to register-based.
    A later example is Android's Dalvik VM for Java, in contrast to the
    stack-based JavaVM.

    There is a paper [shi+08] that provides an academic justification for
    this approach. The gist of it is that, with some additional compiler complexity, the register-based machine can reduce the number of NEXTs
    (in Forth threaded-code terminology); depending on the implementation
    approach and the hardware, the NEXTs could be the major cost at the
    time. However, already at the time dynamic superinstructions (in implementation technique for virtual-machine interpreters) reduced the
    number of NEXTs to one per basic block, and VM registers did nothing
    to reduce NEXTs in that case; Shi et al. also showed that with a lot
    of compiler sophistication (data flow analysis etc.) VM registers can
    be as fast as stacks even with dynamic superinstructions.

    However, given that dynamic superinstructions are easier to implement
    and the VM registers do not give a benefit when they are employed, why
    would one go for VM registers? Of course, in the Forth setting one
    could offload the optimization onto the programmer, but even Chuck
    Moore did not go there.

    In any case, here's an example extracted from Figure 6 of the paper:

    Java VM (stack) VM registers
    19 iload_1
    20 bipush #31 iconst #31 -> r1
    21 imul imul r6 r1 -> r3
    22 aload_0
    23 getfield value getfield r0.value -> r5
    24 iload_3
    26 caload caload r5 r7 -> r5
    27 iadd iadd r3 r5 -> r6
    28 istore_1

    So yes, the VM register code contains fewer VM instructions. Is it
    clearer?

    The corresponding Gforth code is the stuff between IF and THEN in the following:

    0
    value: some-field
    value: value
    constant some-struct

    : foo
    {: r0 r1 r3 :}
    if
    r1 31 * r0 value r3 + c@ + to r1
    then
    r1 ;

    The code that Gforth produces for the basic block under consideration
    is:

    $7FC624AA0958 @local1 1->1
    7FC62464A5BA: mov [r10],r13
    7FC62464A5BD: sub r10,$08
    7FC62464A5C1: mov r13,$08[rbp]
    $7FC624AA0960 lit 1->2
    $7FC624AA0968 #31
    7FC62464A5C5: sub rbx,$50
    7FC62464A5C9: mov r15,-$08[rbx]
    $7FC624AA0970 * 2->1
    7FC62464A5CD: imul r13,r15
    $7FC624AA0978 @local0 1->2
    7FC62464A5D1: mov r15,$00[rbp]
    $7FC624AA0980 lit+ 2->2
    $7FC624AA0988 #8
    7FC62464A5D5: add r15,$18[rbx]
    $7FC624AA0990 @ 2->2
    7FC62464A5D9: mov r15,[r15]
    $7FC624AA0998 @local2 2->3
    7FC62464A5DC: mov r9,$10[rbp]
    $7FC624AA09A0 + 3->2
    7FC62464A5E0: add r15,r9
    $7FC624AA09A8 c@ 2->2
    7FC62464A5E3: movzx r15d,byte PTR [r15]
    $7FC624AA09B0 + 2->1
    7FC62464A5E7: add r13,r15
    $7FC624AA09B8 !local1 1->1
    7FC62464A5EA: add r10,$08
    7FC62464A5EE: mov $08[rbp],r13
    7FC62464A5F2: mov r13,[r10]
    7FC62464A5F5: add rbx,$50

    There are 8 loads and 2 stores in that code. If the VM registers are
    held in memory (as they usually are, and as the Gforth locals are),
    the VM register code performs at least 9 loads (7 register accesses,
    the getfield, and the caload) and 5 stores. Of course, in Forth one
    would write the block as:

    : foo1 ( n3 a0 n1 -- n )
    31 * swap value rot + c@ + ;

    and the code for that is (without the ";"):

    $7FC624AA0A10 lit 1->2
    $7FC624AA0A18 #31
    7FC62464A617: mov r15,$08[rbx]
    $7FC624AA0A20 * 2->1
    7FC62464A61B: imul r13,r15
    $7FC624AA0A28 swap 1->2
    7FC62464A61F: mov r15,$08[r10]
    7FC62464A623: add r10,$08
    $7FC624AA0A30 lit+ 2->2
    $7FC624AA0A38 #8
    7FC62464A627: add r15,$28[rbx]
    $7FC624AA0A40 @ 2->2
    7FC62464A62B: mov r15,[r15]
    $7FC624AA0A48 rot 2->3
    7FC62464A62E: mov r9,$08[r10]
    7FC62464A632: add r10,$08
    $7FC624AA0A50 + 3->2
    7FC62464A636: add r15,r9
    $7FC624AA0A58 c@ 2->2
    7FC62464A639: movzx r15d,byte PTR [r15]
    $7FC624AA0A60 + 2->1
    7FC62464A63D: add r13,r15

    6 loads, 0 stores.

    And if we feed the equivalent standard code

    0
    field: some-field
    field: value-addr
    constant some-struct

    : foo1 ( n3 a0 n1 -- n )
    31 * swap value-addr @ rot + c@ + ;

    into other Forth systems, some produce even better code:

    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    FOO1
    ( 0050A310 486BDB1F ) IMUL RBX, RBX, # 1F
    ( 0050A314 488B5500 ) MOV RDX, [RBP]
    ( 0050A318 488B4D08 ) MOV RCX, [RBP+08]
    ( 0050A31C 48034A08 ) ADD RCX, [RDX+08]
    ( 0050A320 480FB609 ) MOVZX RCX, Byte 0 [RCX]
    ( 0050A324 4803D9 ) ADD RBX, RCX
    ( 0050A327 488D6D10 ) LEA RBP, [RBP+10]
    ( 0050A32B C3 ) RET/NEXT
    ( 28 bytes, 8 instructions )

    5 loads, 0 stores. And VFX does not do data-flow analysis across
    basic blocks, unlike the Java VM -> VM register compiler that Shi
    used; i.e., VFX is probably simpler than the compiler Shi used.

    @Article{shi+08,
    author = {Yunhe Shi and Kevin Casey and M. Anton Ertl and
    David Gregg},
    title = {Virtual machine showdown: Stack versus registers},
    journal = {ACM Transactions on Architecture and Code
    Optimization (TACO)},
    year = {2008},
    volume = {4},
    number = {4},
    pages = {21:1--21:36},
    month = jan,
    url = {http://doi.acm.org/10.1145/1328195.1328197},
    abstract = {Virtual machines (VMs) enable the distribution of
    programs in an architecture-neutral format, which
    can easily be interpreted or compiled. A
    long-running question in the design of VMs is
    whether a stack architecture or register
    architecture can be implemented more efficiently
    with an interpreter. We extend existing work on
    comparing virtual stack and virtual register
    architectures in three ways. First, our translation
    from stack to register code and optimization are
    much more sophisticated. The result is that we
    eliminate an average of more than 46\% of
    executed VM instructions, with the bytecode size of
    the register machine being only 26\% larger
    than that of the corresponding stack one. Second, we
    present a fully functional virtual-register
    implementation of the Java virtual machine (JVM),
    which supports Intel, AMD64, PowerPC and Alpha
    processors. This register VM supports
    inline-threaded, direct-threaded, token-threaded,
    and switch dispatch. Third, we present experimental
    results on a range of additional optimizations such
    as register allocation and elimination of redundant
    heap loads. On the AMD64 architecture the register
    machine using switch dispatch achieves an average
    speedup of 1.48 over the corresponding stack
    machine. Even using the more efficient
    inline-threaded dispatch, the register VM achieves a
    speedup of 1.15 over the equivalent stack-based VM.}
    }

    Clarity of the code comes as a "bonus" :) yes, we've
    got VALUEs and I use them when needed, but their use
    still means employing the "Forth machine".

    What do you mean with 'the "Forth machine"', and how does "OOS"
    (whatever that is) avoid it?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to LIT on Tue Feb 25 11:20:47 2025
    zbigniew2011@gmail.com (LIT) writes:
    I mean the description how the "Forth machine" works:

    "Assume SQUARE is encountered while executing some other Forth word.
    Forth's Interpreter Pointer (IP) will be pointing to a cell in memory -- >contained within that "other" word -- which contains the address of the
    word SQUARE. (To be precise, that cell contains the address of SQUARE's
    Code Field.) The interpreter fetches that address, and then uses it to
    fetch the contents of SQUARE's Code Field. These contents are yet
    another address -- the address of a machine language subroutine which >performs the word SQUARE. In pseudo-code, this is:

    (IP) -> W fetch memory pointed by IP into "W" register
    ...W now holds address of the Code Field
    IP+2 -> IP advance IP, just like a program counter
    (assuming 2-byte addresses in the thread)
    (W) -> X fetch memory pointed by W into "X" register
    ...X now holds address of the machine code
    JP (X) jump to the address in the X register

    This illustrates an important but rarely-elucidated principle: the
    address of the Forth word just entered is kept in W. CODE words don't
    need this information, but all other kinds of Forth words do.

    If SQUARE were written in machine code, this would be the end of the
    story: that bit of machine code would be executed, and then jump back to
    the Forth interpreter -- which, since IP was incremented, is pointing to
    the next word to be executed. This is why the Forth interpreter is
    usually called NEXT.

    But, SQUARE is a high-level "colon" definition… [..]” etc.

    ( https://www.bradrodriguez.com/papers/moving1.htm )

    Many of these steps in particular cases can be avoided
    by the use of proposed OOS words, making (at least sometimes)
    the Forth program faster — and, as a kinda "bonus", clarity
    of the code increases.

    What Rodriguez describes above is NEXT. As I mentioned in the ealier
    posting, using a VM with VM registers reduces the number of NEXTs
    executed, but if you go for dynamic superinstructions or native-code compilation, the number of NEXTs is reduced even more. And this can
    be done while still working with ordinary Forth code, no OOS needed.
    And these kinds of compilers can be done with relatively little
    effort.

    Probably in case of the "optimizing compiler" the gain
    may not be too significant, from what I already learned
    here, still in the case of simpler compilers — and maybe
    especially in the case of the ones created for CPUs not
    that suitable for Forth at all (lack of registers, like
    8051, for example) — probably it may be advantageous.

    I cannot speak about the 8051, but machine Forth is a simple
    native-code system and it's stack-based.

    By the "Forth machine" I mean that internal work of the
    Forth compiler - see the above quote from Brad's paper
    - and when we don't need to "fetch memory pointed by
    IP into "W" register, advance IP, just like a program
    counter" etc. etc. — replacing the whole process,
    (which is repeated for each subsequent word again and
    again) by a short string of ML instructions — we should
    note significant gain in the processing speed.

    Yes, dynamic superinstructions provide a good speedup for Gforth, and native-code systems also show a good speedup compared to classic
    threaded-code systems. But it's not necessary to eliminate the stack
    for that. Actually dealing with the stack is orthogonal to
    threaded code vs. native code.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to All on Tue Feb 25 11:16:26 2025
    But, SQUARE is a high-level "colon" definition… [..]” etc.

    ( https://www.bradrodriguez.com/papers/moving1.htm )

    Many of these steps in particular cases can be avoided
    by the use of proposed OOS words, making (at least sometimes)
    the Forth program faster — and, as a kinda "bonus", clarity
    of the code increases.

    After having avoided premature optimisation, every 'decent'
    Forth programmer will recode some few bottleneck words e.g.
    in assembler, where necessary. IOW microbenchmarking SQUARE,
    which can be implemented in a handful of lines of machine code
    or less, does not bring new insights.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to LIT on Tue Feb 25 13:45:09 2025
    In article <591e7bf58ebb1f90bd34fba20c730b83@www.novabbs.com>,
    LIT <zbigniew2011@gmail.com> wrote:
    I wonder: wouldn't it be useful to have stackless basic
    arith operations? I mean instead of fetching the values
    first and putting them on the stack, then doing something,
    and in the end storing the result somewhere wouldn't it
    be practical to use directly the variables? Like this:

    : +> ( addr1 addr2 addr3 -- )
    rot @ rot @ + swap ! ;

    Of course the above is just an illustration; I mean coding
    such word directly in ML. It should be significantly
    faster than going through stack usual way.

    But after I came up with this idea I realized someone
    surely invented that before - it looks so obvious — yet
    I didn't see it anywhere. Did anyone of you see something
    like this in any code? If so — actually why somehow
    (probably?) such solution has not become widespread?
    Looks good to me; math can be done completely in ML
    avoiding "Forth machine" engagement, therefore saving many
    cycles.

    I have done some work on optimisation on ciforth.
    This work has stalled, but the infamous byte prime benchmark,
    was in the ballpark of swiftforth and mpeforth.
    (Disingenuous, because this was the example I used.)
    See https://home.hccnet.nl/a.w.m.van.der.horst/forthlecture5.html
    This is about folding, a generalisation of constant folding.
    This requires that you know the properties of the Forth Words,
    i.e. that you can execute + at compile time, if the inputs
    are constant.

    The next step is inlining, which requires transforming control
    structures to jumps. This eliminates all call/return pairs.

    A further step is replacing stack offset operations with registers
    operations. I have succeeded in eliminating the use of the return
    stack in a resulting block of code. Remember, there are no longer
    return addresses on the return stack.

    Then I got stalled. I introduced complicated rules to handle
    pop push and operators to simplify by interchanging and transforming.
    E.g. a rule
    movipop-pattern DUP matches? IF ?movipop-replace? ELSE
    test if a pattern applies, that execute the replacement.
    This is a one rule of the "no brain" matches, the simplest.

    "
    <! !Q! MOVI|X, !!T 0 {L,} ~!!T !Q! POP|X, !!T !>
    <A Q: POP|X, !TALLY NEXT A>
    { bufv 7 + C@ 7 AND bufc 1 + OR!U }
    optimisation movipop-pattern \ A object

    \ Relying heavily on a smart assembler/disassembler
    \ optimisation is a class name.

    \ :" it is all the same register."
    : movipop-same bufv 1+ C@ bufv 7 + C@ XOR 7 AND 0= ;

    \ Optional replace, leave " was replaced".
    : ?movipop-replace? movipop-same DUP IF replace THEN ;
    REGRESS HERE Q: MOVI|X, BX| 0 IL, Q: POP|X, AX| matches? movipop-same S: TRUE
    "

    This is going nowhere. Instead the technique of replacing
    cells offset from the data stack must be used.
    It has proven to work totally replacing return stack manipulations
    by registers.

    I chase a different goal here, get code that I can't improve studying
    the assembler code.Pretty silly, given that i86 is a dying
    architecture.

    The goal can be attained. I remember a 4 page comparison function
    in C, compiled by Intel C-compiler. There was not a single
    thing to improve upon.

    And a general remark. Optimise where it counts, replace the
    bottle neck. That is practical. All the other is sport, like
    Mount Everest or the South Pole.








    --
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to LIT on Tue Feb 25 13:46:52 2025
    In article <a81ac9ee2ed92686e940a55bed9d4dfb@www.novabbs.com>,
    LIT <zbigniew2011@gmail.com> wrote:
    In case of slower ITC non-optimizing Forths - like
    fig-Forth, as the most obvious example - the "boost"
    may be noticeable.
    I'll check that.

    code +> ( x y z -- )
    dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
    0 [bx] ax add dx bx xchg ax 0 [bx] mov next
    end-code

    Timing (adjusted for loop time):

    var1 @ var2 @ + var3 ! 8019 mS
    var1 var2 var3 +> 5657 mS

    So even in case of fast DTC Forth, like DX Forth,
    it's already something worth of closer attention,
    I believe.
    I expect even bigger gain in case of older fig-Forth
    model.

    Gain is only to be expected in the context of an application.

    --

    Groetjes Albert
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From minforth@21:1/5 to LIT on Tue Feb 25 13:25:00 2025
    On Tue, 25 Feb 2025 11:40:46 +0000, LIT wrote:

    I agree with you - still it does take decent Forth programmer.
    Recall the ones described by Jeff Fox? These Forth programmers,
    that refused to use Machine Forth just because "they were hired
    to program in ANS Forth"?
    I don't believe they were be able to recode anything in
    assembler - and note, it was about 30 years ago. Since that
    time assembler programming became even less popular.

    A bit off-topic: I have been in a similar situation when some of
    our service engineers were very reluctant to modify inner
    software parts of controllers. The guys were not dumb, but with
    such modifications comes responsibility when s.th. unexpected
    happens like a system crash. So it was more of a legal than a
    technical issue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gerry Jackson@21:1/5 to LIT on Tue Feb 25 15:32:47 2025
    On 25/02/2025 14:36, LIT wrote:
    A bit off-topic: I have been in a similar situation when some of
    our service engineers were very reluctant to modify inner
    software parts of controllers. The guys were not dumb, but with
    such modifications comes responsibility when s.th. unexpected
    happens like a system crash. So it was more of a legal than a
    technical issue.

    Yes, I'm aware the reason may be different in the different
    case; still Jeff portrayed that situation rather clear way:
    they didn't want to use Machine Forth just because "they
    were paid for ANS Forth programming", they signed kind of
    agreement for that, therefore they "weren't interested" in
    any changes etc.

    Unfortunately we won't have any opportunity anymore to ask
    Jeff for more details.

    --

    Sounds like a management failure, they should have mandated that Machine
    Forth was to be used when the programmers were hired.

    --
    Gerry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to LIT on Tue Feb 25 18:04:30 2025
    zbigniew2011@gmail.com (LIT) writes:
    I know nothing about Machine Forth.
    BTW: is it available for download anywhere (if not
    commercial/restricted)?

    I think it's the compiler part of colorForth <https://colorforth.github.io/cf.htm>. Looking around a bit I find <https://colorforth.github.io/forth.html>, which shows how machine
    Forth primitives are compiled.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to LIT on Tue Feb 25 22:35:42 2025
    zbigniew2011@gmail.com (LIT) writes:
    V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms

    Too much OOS thinking? Try

    V1 @ V2 @ V1 ! V2 !

    V1 V2 :=: - 15s 260ms

    So there is a noticeable difference indeed.

    The question is how often you use these new words in applications.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to LIT on Wed Feb 26 00:50:52 2025
    LIT <zbigniew2011@gmail.com> wrote:
    So I did some quite basic testing with x86
    fig-Forth for DOS. I devised 4 OOS words:

    :=: (exchange values among two variables)
    pop BX
    pop DI
    mov AX,[BX]
    xchg AX,[DI]
    mov [BX],AX
    jmp NEXT

    ++ (increment variable by one)
    pop BX
    inc WORD PTR [BX}
    jmp NEXT

    -- (similar to above, just uses dec -- not tested, it'll give same
    result)

    (add two variables then store result into third one)
    pop DI
    pop BX
    mov CX,[BX]
    pop BX
    mov AX,[BX]
    add AX,CX
    mov [DI],AX
    jmp NEXT

    How the simplistic tests have been done:

    7 VARIABLE V1
    8 VARIABLE V2
    9 VARIABLE V3

    : TOOK ( t1 t2 -- )
    DROP SPLIT TIME@ DROP SPLIT
    ROT SWAP - CR ." It took " U. ." seconds and "
    - 10 * U. ." milliseconds "
    ;

    : TEST1
    1000 0 DO 10000 0 DO
    ...expression...
    LOOP LOOP
    ;

    0 0 TIME! TIME@ TEST TOOK

    The results are (for the following expressions):

    V1 @ V2 @ + V3 ! - 25s 430ms
    V1 V2 V3 +> - 17s 240ms

    1 V1 +! - 14s 60ms
    V1 ++ - 10s 820ms

    V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms
    V1 V2 :=: - 15s 260ms

    So there is a noticeable difference indeed.

    If your expected use case is operations on variables, then
    what you gain is merging @ and ! onto operations. Since
    you still have variables, gain is at most a factor of 2
    (you replace things by V1 @ by plain V1). Cost is need to
    have several extra operations. Potential alternative is
    a pair of operations, say PUSH and POP, and Forth compiler
    that replaces pair like V1 @ by PUSH(V1). Note that here
    address of V1 is intended to be part to PUSH (so it will
    take as much space as separate V1 and @, but is only a
    single primitive).

    More generally, a simple "optimizer" that replaces short
    sequences of Forth primitives by different, shorter sequence
    of primitives is likely to give similar gain. However,
    chance of match decreases with length of the sequence.
    Above you bet on relatively long seqences (and on programmer
    writing alternative seqence). Shorter seqences have more
    chance of matching, so you need smaller number of them
    for similar gain.

    Extra thing: while simple memory to memory operations appear
    with some frequency rather typical pattern is expressions
    that produce some value that is immediately used by another
    operation, stack is very good fit for such use. One can
    do better than using machine stack, namely keeping thing in
    registers, but that means generating machine code and doing
    optimization. OTOH on 64-bit machines machine code is
    very natural: machine instructions are typically smaller
    than machine words (which are natural unit for threaded
    code) and Forth primitives are likely to produce very
    small number of instructions.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)