• Generating a random sequence of Forth words

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Sep 30 16:33:50 2025
    From Newsgroup: comp.lang.forth

    I wanted to generate a random sequence consisting of a number, NOOP,
    and DROP. When EVALUATEing the sequence, starting with an empty
    stack, it should end with an empty stack, and should not underflow the
    stack.

    I implemented this twice. The first attempt just generated a random word/number in the normal case, and a specific kind of word/number in
    the boundary cases:

    : genseq {: u -- :}
    0 u 0 do
    case
    dup 0= ?of 1+ ." 9 " endof
    dup i + u u>= ?of 1- ." drop " endof
    dup i + 1+ u u>= ?of ." noop " endof
    rnd 3 um* nip 1- dup case
    -1 of ." drop " endof
    0 of ." noop " endof
    1 of ." 8 " endof
    endcase
    +
    0 endcase
    loop
    drop ;

    When I do 16 GENSEQ with seed 1234, I get the following sequence:

    9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop

    The requirements are distributed across several places in this word,
    which I did not like. I also did not like the non-randomness of the
    sequence of DROPs at the end, and the non-randomness of the 9 whenever
    the stack is empty. I also found that I needed to add a clause for
    when the stack depth is one less than the number of remaining words.

    So I thought about other ways to do it.

    Eventually I came up with the following approach: Starts with counters
    for the different kinds of words/numbers. Whenever one word/number is generated, its counter is reduced. At each step the probability of
    generating a kind of word/number is proportional to the remaining
    counter for that word/number. The same-depth requirement is
    implemented by having the same initial counter for the number and
    DROP. The no-underflow requirement is implemented by skipping the
    output of a DROP if the count of remaining DROPs = the count of the
    remaining numbers.

    Here's the code:

    create counts 3 cells allot

    : initcounts ( u -- )
    dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

    : .counts ( -- )
    counts 3 cell array>mem mem+do
    i @ 4 .r
    loop ;

    : countsth-- ( u -- )
    -1 counts rot th +! ;

    : select1 ( u -- )
    \ u is the index of the word
    dup case
    0 of ." 9 " countsth-- endof
    1 of ." noop " countsth-- endof
    2 of counts @ counts 2 th @ u< if
    ." drop " countsth--
    else
    drop endif
    endof
    endcase ;

    : sum ( addr u -- u2 )
    \ addr u is an array with u elements, u2 is the sum of the elements
    0 -rot cell array>mem mem+do
    i @ +
    loop ;

    : genseq ( u -- )
    initcounts
    begin
    counts 3 sum dup while
    rnd um* nip ( sum )
    3 0 u+do ( sum1 )
    counts i th @ 2dup u< if
    i select1 then
    - loop
    drop
    repeat
    drop ;

    The code is much longer, but I like it better. The requirements are concentrated in two places.

    16 GENSEQ generates

    9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop

    for seed 123. A disadvantage of this approach is that it does not
    generate a word/number on each iteration through the main loop. I
    have thought about how to achieve this, but the result would be quite
    a bit more complex; it's also only a minor issue. E.g., for
    generating 1000 words/numbers with seed 123, the main loop has 1036
    iterations.

    The second approach takes quite a bit more time; for generating a
    million words/numbers into a string on a Zen 4, 192_876_892 cycles
    compared to 146_644_156 cycles.

    The original goal was to produce a sequence where various parts of the
    text interpreter have branch mispredictions; the less predictable the
    sequence, the better. In this respect the second sequence is
    significantly better. For a sequence of 1000 words/numbers,
    text-interpreted 1000 times, the number of mispredictions is:

    728_322 first approach
    821_813 second approach

    (with 247_125 mispredictions coming from (mainly) Gforth startup).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Wed Oct 1 11:20:56 2025
    From Newsgroup: comp.lang.forth

    Am 30.09.2025 um 18:33 schrieb Anton Ertl:
    I wanted to generate a random sequence consisting of a number, NOOP,
    and DROP. When EVALUATEing the sequence, starting with an empty
    stack, it should end with an empty stack, and should not underflow the
    stack.

    I implemented this twice. The first attempt just generated a random word/number in the normal case, and a specific kind of word/number in
    the boundary cases:

    : genseq {: u -- :}
    0 u 0 do
    case
    dup 0= ?of 1+ ." 9 " endof
    dup i + u u>= ?of 1- ." drop " endof
    dup i + 1+ u u>= ?of ." noop " endof
    rnd 3 um* nip 1- dup case
    -1 of ." drop " endof
    0 of ." noop " endof
    1 of ." 8 " endof
    endcase
    +
    0 endcase
    loop
    drop ;

    When I do 16 GENSEQ with seed 1234, I get the following sequence:

    9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop

    The requirements are distributed across several places in this word,
    which I did not like. I also did not like the non-randomness of the
    sequence of DROPs at the end, and the non-randomness of the 9 whenever
    the stack is empty. I also found that I needed to add a clause for
    when the stack depth is one less than the number of remaining words.

    So I thought about other ways to do it.

    Eventually I came up with the following approach: Starts with counters
    for the different kinds of words/numbers. Whenever one word/number is generated, its counter is reduced. At each step the probability of generating a kind of word/number is proportional to the remaining
    counter for that word/number. The same-depth requirement is
    implemented by having the same initial counter for the number and
    DROP. The no-underflow requirement is implemented by skipping the
    output of a DROP if the count of remaining DROPs = the count of the
    remaining numbers.

    Here's the code:

    create counts 3 cells allot

    : initcounts ( u -- )
    dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

    : .counts ( -- )
    counts 3 cell array>mem mem+do
    i @ 4 .r
    loop ;

    : countsth-- ( u -- )
    -1 counts rot th +! ;

    : select1 ( u -- )
    \ u is the index of the word
    dup case
    0 of ." 9 " countsth-- endof
    1 of ." noop " countsth-- endof
    2 of counts @ counts 2 th @ u< if
    ." drop " countsth--
    else
    drop endif
    endof
    endcase ;

    : sum ( addr u -- u2 )
    \ addr u is an array with u elements, u2 is the sum of the elements
    0 -rot cell array>mem mem+do
    i @ +
    loop ;

    : genseq ( u -- )
    initcounts
    begin
    counts 3 sum dup while
    rnd um* nip ( sum )
    3 0 u+do ( sum1 )
    counts i th @ 2dup u< if
    i select1 then
    - loop
    drop
    repeat
    drop ;

    The code is much longer, but I like it better. The requirements are concentrated in two places.

    16 GENSEQ generates

    9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop

    for seed 123. A disadvantage of this approach is that it does not
    generate a word/number on each iteration through the main loop. I
    have thought about how to achieve this, but the result would be quite
    a bit more complex; it's also only a minor issue. E.g., for
    generating 1000 words/numbers with seed 123, the main loop has 1036 iterations.

    The second approach takes quite a bit more time; for generating a
    million words/numbers into a string on a Zen 4, 192_876_892 cycles
    compared to 146_644_156 cycles.

    The original goal was to produce a sequence where various parts of the
    text interpreter have branch mispredictions; the less predictable the sequence, the better. In this respect the second sequence is
    significantly better. For a sequence of 1000 words/numbers,
    text-interpreted 1000 times, the number of mispredictions is:

    728_322 first approach
    821_813 second approach

    (with 247_125 mispredictions coming from (mainly) Gforth startup).


    How about
    1) quick generation of a suboptimal list
    2) create an array of pointers to the list elements
    3) random shuffle the array e.g. https://en.wikipedia.org/wiki/Fisher-Yates_shuffle

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Wed Oct 1 17:11:21 2025
    From Newsgroup: comp.lang.forth

    I used something similar - but with a whole slew of stack operations.
    Very handy thingy. I use it until this day.

    ( abc -- abcabc) >r over over r@ rot rot r> \ >r 2dup r@ -rot r>

    Hans Bezemer

    On 30-09-2025 18:33, Anton Ertl wrote:
    I wanted to generate a random sequence consisting of a number, NOOP,
    and DROP. When EVALUATEing the sequence, starting with an empty
    stack, it should end with an empty stack, and should not underflow the
    stack.

    I implemented this twice. The first attempt just generated a random word/number in the normal case, and a specific kind of word/number in
    the boundary cases:

    : genseq {: u -- :}
    0 u 0 do
    case
    dup 0= ?of 1+ ." 9 " endof
    dup i + u u>= ?of 1- ." drop " endof
    dup i + 1+ u u>= ?of ." noop " endof
    rnd 3 um* nip 1- dup case
    -1 of ." drop " endof
    0 of ." noop " endof
    1 of ." 8 " endof
    endcase
    +
    0 endcase
    loop
    drop ;

    When I do 16 GENSEQ with seed 1234, I get the following sequence:

    9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop

    The requirements are distributed across several places in this word,
    which I did not like. I also did not like the non-randomness of the
    sequence of DROPs at the end, and the non-randomness of the 9 whenever
    the stack is empty. I also found that I needed to add a clause for
    when the stack depth is one less than the number of remaining words.

    So I thought about other ways to do it.

    Eventually I came up with the following approach: Starts with counters
    for the different kinds of words/numbers. Whenever one word/number is generated, its counter is reduced. At each step the probability of generating a kind of word/number is proportional to the remaining
    counter for that word/number. The same-depth requirement is
    implemented by having the same initial counter for the number and
    DROP. The no-underflow requirement is implemented by skipping the
    output of a DROP if the count of remaining DROPs = the count of the
    remaining numbers.

    Here's the code:

    create counts 3 cells allot

    : initcounts ( u -- )
    dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

    : .counts ( -- )
    counts 3 cell array>mem mem+do
    i @ 4 .r
    loop ;

    : countsth-- ( u -- )
    -1 counts rot th +! ;

    : select1 ( u -- )
    \ u is the index of the word
    dup case
    0 of ." 9 " countsth-- endof
    1 of ." noop " countsth-- endof
    2 of counts @ counts 2 th @ u< if
    ." drop " countsth--
    else
    drop endif
    endof
    endcase ;

    : sum ( addr u -- u2 )
    \ addr u is an array with u elements, u2 is the sum of the elements
    0 -rot cell array>mem mem+do
    i @ +
    loop ;

    : genseq ( u -- )
    initcounts
    begin
    counts 3 sum dup while
    rnd um* nip ( sum )
    3 0 u+do ( sum1 )
    counts i th @ 2dup u< if
    i select1 then
    - loop
    drop
    repeat
    drop ;

    The code is much longer, but I like it better. The requirements are concentrated in two places.

    16 GENSEQ generates

    9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop

    for seed 123. A disadvantage of this approach is that it does not
    generate a word/number on each iteration through the main loop. I
    have thought about how to achieve this, but the result would be quite
    a bit more complex; it's also only a minor issue. E.g., for
    generating 1000 words/numbers with seed 123, the main loop has 1036 iterations.

    The second approach takes quite a bit more time; for generating a
    million words/numbers into a string on a Zen 4, 192_876_892 cycles
    compared to 146_644_156 cycles.

    The original goal was to produce a sequence where various parts of the
    text interpreter have branch mispredictions; the less predictable the sequence, the better. In this respect the second sequence is
    significantly better. For a sequence of 1000 words/numbers,
    text-interpreted 1000 times, the number of mispredictions is:

    728_322 first approach
    821_813 second approach

    (with 247_125 mispredictions coming from (mainly) Gforth startup).

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Oct 1 17:10:34 2025
    From Newsgroup: comp.lang.forth

    minforth <minforth@gmx.net> writes:
    Am 30.09.2025 um 18:33 schrieb Anton Ertl:
    Eventually I came up with the following approach: Starts with counters
    for the different kinds of words/numbers. Whenever one word/number is
    generated, its counter is reduced. At each step the probability of
    generating a kind of word/number is proportional to the remaining
    counter for that word/number. The same-depth requirement is
    implemented by having the same initial counter for the number and
    DROP. The no-underflow requirement is implemented by skipping the
    output of a DROP if the count of remaining DROPs = the count of the
    remaining numbers.

    Here's the code:

    create counts 3 cells allot

    : initcounts ( u -- )
    dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

    : .counts ( -- )
    counts 3 cell array>mem mem+do
    i @ 4 .r
    loop ;

    : countsth-- ( u -- )
    -1 counts rot th +! ;

    : select1 ( u -- )
    \ u is the index of the word
    dup case
    0 of ." 9 " countsth-- endof
    1 of ." noop " countsth-- endof
    2 of counts @ counts 2 th @ u< if
    ." drop " countsth--
    else
    drop endif
    endof
    endcase ;

    : sum ( addr u -- u2 )
    \ addr u is an array with u elements, u2 is the sum of the elements
    0 -rot cell array>mem mem+do
    i @ +
    loop ;

    : genseq ( u -- )
    initcounts
    begin
    counts 3 sum dup while
    rnd um* nip ( sum )
    3 0 u+do ( sum1 )
    counts i th @ 2dup u< if
    i select1 then
    - loop
    drop
    repeat
    drop ;
    ...
    How about
    1) quick generation of a suboptimal list
    2) create an array of pointers to the list elements
    3) random shuffle the array e.g. >https://en.wikipedia.org/wiki/Fisher-Yates_shuffle

    I thought about an approach based on this in between, with appropriate
    changes to satisfy the requirements. Then I continued onwards to the
    second approach shown above. The counters are a kind of compressed
    form of "a suboptimal list". Every time you pick one of the elements
    of the suboptimal list, the counter for that kind of element is
    lowered by 1.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Wed Oct 1 20:42:37 2025
    From Newsgroup: comp.lang.forth

    Am 01.10.2025 um 17:11 schrieb Hans Bezemer:
    I used something similar - but with a whole slew of stack operations.
    Very handy thingy. I use it until this day.

    ( abc -- abcabc)-a-a >r over over r@ rot rot r> \ >r 2dup r@ -rot r>


    Different and probably not as efficient as your code generator:

    MinForth 3.6
    # : 3DUP { a b c == a b c a b c } ;

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Thu Oct 2 19:49:29 2025
    From Newsgroup: comp.lang.forth

    On 2/10/2025 4:42 am, minforth wrote:
    Am 01.10.2025 um 17:11 schrieb Hans Bezemer:
    I used something similar - but with a whole slew of stack operations. Very handy thingy. I use it until this day.

    ( abc -- abcabc)-a-a >r over over r@ rot rot r> \ >r 2dup r@ -rot r>


    Different and probably not as efficient as your code generator:

    MinForth 3.6
    # : 3DUP { a b c == a b c a b c } ;

    For 3DUP I believe this is the one to beat:

    : 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

    With NTF/LFX the locals version will break even. For others, well, it may
    be better not to look. For a straight-forward example of 'stack juggling', locals handle it rather poorly.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Oct 2 13:07:54 2025
    From Newsgroup: comp.lang.forth

    In article <68de4aaa@news.ausics.net>, dxf <dxforth@gmail.com> wrote:
    On 2/10/2025 4:42 am, minforth wrote:
    Am 01.10.2025 um 17:11 schrieb Hans Bezemer:
    I used something similar - but with a whole slew of stack operations.
    Very handy thingy. I use it until this day.

    ( abc -- abcabc)-a-a >r over over r@ rot rot r> \ >r 2dup r@ -rot r>


    Different and probably not as efficient as your code generator:

    MinForth 3.6
    # : 3DUP { a b c == a b c a b c } ;

    For 3DUP I believe this is the one to beat:

    : 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

    CODE 3DUP
    PUSH RSP[3*CELL_SIZE]
    PUSH RSP[3*CELL_SIZE]
    PUSH RSP[3*CELL_SIZE]
    NEXT,

    How does that look for 5DUP ?


    With NTF/LFX the locals version will break even. For others, well, it may
    be better not to look. For a straight-forward example of 'stack juggling', >locals handle it rather poorly.

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Oct 2 20:44:40 2025
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    For 3DUP I believe this is the one to beat:

    : 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

    With NTF/LFX the locals version will break even.

    As we already discussed in the thread including <2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
    same (optimal for the calling convention used by NTF/LXF) code for
    3DUP versions using the data stack, return stack, or locals. That's
    because the actual data flow is always the same, and NTF/LXF can see
    this data flow in all three cases.

    For others, well, it may
    be better not to look. For a straight-forward example of 'stack juggling', >locals handle it rather poorly.

    Other Forth systems implement locals poorly. LXF/NTF demonstrates
    that this is not due to some natural law, however.

    There have been some improvements in Gforth since that time. Let's
    see how the versions used in that thread look on today's gforth-fast.
    Here are the versions of 3DUP:

    : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    : 3dup.3 {: a b c :} a b c a b c ;
    : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

    And here's the gforth-fast code on AMD64:

    3dup.1 3dup.2 3dup.3 3dup.4
    r 1->0 third 1->2 >l >l 1->1 dup 1->1
    mov -$08[r14],r13 mov r15,$10[r10] >l 1->1 mov [r10],r13
    sub r14,$08 third 2->3 mov -$08[rbp],r13 sub r10,$08 2dup 0->2 mov r9,$08[r10] mov rdx,$08[r10] 2over 1->3
    mov r13,$10[r10] third 3->1 mov rax,rbp mov r15,$18[r10
    mov r15,$08[r10] mov [r10],r13 add r10,$10 mov r9,$10[r10]
    i 2->3 sub r10,$18 lea rbp,-$10[rbp] rot 3->1
    mov r9,[r14] mov $10[r10],r15 mov -$10[rax],rdx mov [r10],r15 -rot 3->2 mov $08[r10],r9 mov r13,[r10] sub r10,$10
    mov [r10],r9 ;s 1->1 >l @local0 1->1 mov $08[r10],r9
    sub r10,$08 mov rbx,[r14] @local0 1->1 ;s 1->1
    2->1 add r14,$08 mov rax,rbp mov rbx,[r14]
    mov -$08[r10],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
    sub r10,$10 jmp eax mov -$08[rax],r13 mov rax,[rbx]
    mov $10[r10],r13 @local1 1->2 jmp eax
    mov r13,[r14] mov r15,$08[rbp]
    add r14,$08 @local2 2->1
    ;s 1->1 mov -$08[r10],r15
    mov rbx,[r14] sub r10,$10
    add r14,$08 mov $10[r10],r13
    mov rax,[rbx] mov r13,$10[rbp]
    jmp eax @local0 1->2
    mov r15,$00[rbp]
    @local1 2->3
    mov r9,$08[rbp]
    @local2 3->1
    mov -$10[r10],r9
    sub r10,$18
    mov $10[r10],r15
    mov $18[r10],r13
    mov r13,$10[rbp]
    lit 1->2
    #24
    mov r15,$50[rbx]
    lp+! 2->1
    add rbp,r15
    ;s 1->1
    mov rbx,[r14]
    add r14,$08
    mov rax,[rbx]
    jmp eax

    Locals-haters, come to Gforth, where locals are implemented
    inefficiently:-). The code for 3DUP.2 is actually optimal for
    Gforth's calling convention.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Fri Oct 3 18:22:37 2025
    From Newsgroup: comp.lang.forth

    On 2/10/2025 9:07 pm, albert@spenarnc.xs4all.nl wrote:
    In article <68de4aaa@news.ausics.net>, dxf <dxforth@gmail.com> wrote:
    On 2/10/2025 4:42 am, minforth wrote:
    Am 01.10.2025 um 17:11 schrieb Hans Bezemer:
    I used something similar - but with a whole slew of stack operations.
    Very handy thingy. I use it until this day.

    ( abc -- abcabc)-a-a >r over over r@ rot rot r> \ >r 2dup r@ -rot r>


    Different and probably not as efficient as your code generator:

    MinForth 3.6
    # : 3DUP { a b c == a b c a b c } ;

    For 3DUP I believe this is the one to beat:

    : 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

    CODE 3DUP
    PUSH RSP[3*CELL_SIZE]
    PUSH RSP[3*CELL_SIZE]
    PUSH RSP[3*CELL_SIZE]
    NEXT,
    ...

    That's equivalent to:

    : 3DUP 2 PICK 2 PICK 2 PICK ;

    I didn't notice but it's how VFX defines it. OTOH DUP 2OVER ROT
    optimizes to the same machine code. IIRC I tested them all on
    DX-Forth and settled on the latter. An 8086 code version would
    beat it. An 8080 code version would be more trouble than worth.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Fri Oct 3 11:02:33 2025
    From Newsgroup: comp.lang.forth

    In article <2025Oct2.224440@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    dxf <dxforth@gmail.com> writes:
    For 3DUP I believe this is the one to beat:

    : 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

    With NTF/LFX the locals version will break even.

    As we already discussed in the thread including ><2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
    same (optimal for the calling convention used by NTF/LXF) code for
    3DUP versions using the data stack, return stack, or locals. That's
    because the actual data flow is always the same, and NTF/LXF can see
    this data flow in all three cases.

    The problem with 3DUP is that it is actually used in context.
    What is the data that is going to 3DUP ped? In my view this
    amounts to double use of data that is in registers (32 in the riscv)
    anyway, after an optimiser does his thing.

    <SNIP>

    - anton
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net to comp.lang.forth on Fri Oct 3 11:09:07 2025
    From Newsgroup: comp.lang.forth

    Am 03.10.2025 um 11:02 schrieb albert@spenarnc.xs4all.nl:
    In article <2025Oct2.224440@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    dxf <dxforth@gmail.com> writes:
    For 3DUP I believe this is the one to beat:

    : 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

    With NTF/LFX the locals version will break even.

    As we already discussed in the thread including
    <2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
    same (optimal for the calling convention used by NTF/LXF) code for
    3DUP versions using the data stack, return stack, or locals. That's
    because the actual data flow is always the same, and NTF/LXF can see
    this data flow in all three cases.

    The problem with 3DUP is that it is actually used in context.
    What is the data that is going to 3DUP ped? In my view this
    amounts to double use of data that is in registers (32 in the riscv)
    anyway, after an optimiser does his thing.


    Code inlining will mend it.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Oct 4 08:04:09 2025
    From Newsgroup: comp.lang.forth

    minforth <minforth@gmx.net> writes:
    Am 03.10.2025 um 11:02 schrieb albert@spenarnc.xs4all.nl:
    The problem with 3DUP is that it is actually used in context.
    What is the data that is going to 3DUP ped? In my view this
    amounts to double use of data that is in registers (32 in the riscv)
    anyway, after an optimiser does his thing.


    Code inlining will mend it.

    Inlining is important for Forth, but it does not make what has been
    called an "analytical optimizer" unnecessary; on the contraray,
    inlining increases the benefit we get from the analytical optimizer.
    E.g., let's consider

    : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    : 3dup.3 {: a b c :} a b c a b c ;
    : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

    : foo.1 3dup.1 + ! ;
    : foo.2 3dup.2 + ! ;
    : foo.3 3dup.3 + ! ;
    : foo.4 3dup.4 + ! ;

    The result produced by VFX64 is:

    foo.1 foo.2 foo.3 foo.4
    PUSH EBX MOV EDX, EBX CALL 3DUP.3 MOV EDX, EBX
    MOV EBX, [ESP] ADD EBX, [EBP] ADD EBX, [EBP] ADD EBX, [EBP]
    POP EDX MOV ECX, [EBP+04] MOV EDX, [EBP+04] MOV ECX, [EBP+04]
    ADD EDX, [EBP] MOV 0 [EBX], ECX MOV 0 [EBX], EDX MOV 0 [EBX], ECX
    MOV ECX, [EBP+04] MOV EBX, EDX MOV EBX, [EBP+08] MOV EBX, EDX
    MOV 0 [EDX], ECX NEXT, LEA EBP, [EBP+0C] NEXT, NEXT, NEXT,

    VFX is only analytical about the data stack, and as a consequence, the implementations of 3dup that only use the data stack work best. When
    the return stack is used, as in 3dup.1/foo.1, VFX produces
    instructions (PUSH for >R, MOV ..., [ESP] for R@ and POP for R>) for
    the return-stack operations. When locals are used, VFX actually
    disables inlining and just calls 3DUP.3.

    Other Forth systems make too little use of inlining, and I have to
    resort to macros to simulate it. We cannot use proper macros for
    3dup.3 (the locals-using variant), so I used EVALUATE-based macros;
    this is just for experimental use, not for production, don't do this
    at home:-)

    Let's see what VFX64 produces for FOO.3 with this:

    FOO.3
    ( 080C0C50 8BD4 ) MOV EDX, ESP
    ( 080C0C52 FF7504 ) PUSH [EBP+04]
    ( 080C0C55 FF7500 ) PUSH [EBP]
    ( 080C0C58 53 ) PUSH EBX
    ( 080C0C59 52 ) PUSH EDX
    ( 080C0C5A 57 ) PUSH EDI
    ( 080C0C5B 8BFC ) MOV EDI, ESP
    ( 080C0C5D 81EC00000000 ) SUB ESP, 00000000
    ( 080C0C63 8B5D08 ) MOV EBX, [EBP+08]
    ( 080C0C66 8D6D0C ) LEA EBP, [EBP+0C]
    ( 080C0C69 8B5708 ) MOV EDX, [EDI+08]
    ( 080C0C6C 03570C ) ADD EDX, [EDI+0C]
    ( 080C0C6F 8B4F08 ) MOV ECX, [EDI+08]
    ( 080C0C72 8B470C ) MOV EAX, [EDI+0C]
    ( 080C0C75 8D6DEC ) LEA EBP, [EBP+-14]
    ( 080C0C78 894D04 ) MOV [EBP+04], ECX
    ( 080C0C7B 894508 ) MOV [EBP+08], EAX
    ( 080C0C7E 8B4F10 ) MOV ECX, [EDI+10]
    ( 080C0C81 894D0C ) MOV [EBP+0C], ECX
    ( 080C0C84 895D10 ) MOV [EBP+10], EBX
    ( 080C0C87 8BDA ) MOV EBX, EDX
    ( 080C0C89 8B5710 ) MOV EDX, [EDI+10]
    ( 080C0C8C 895500 ) MOV [EBP], EDX
    ( 080C0C8F 8B5500 ) MOV EDX, [EBP]
    ( 080C0C92 8913 ) MOV 0 [EBX], EDX
    ( 080C0C94 8B5D04 ) MOV EBX, [EBP+04]
    ( 080C0C97 8D6D08 ) LEA EBP, [EBP+08]
    ( 080C0C9A 8B6704 ) MOV ESP, [EDI+04]
    ( 080C0C9D 8B3F ) MOV EDI, 0 [EDI]
    ( 080C0C9F C3 ) NEXT,

    So inlining did not mend that.

    Here's what lxf produces:

    foo.1 foo.2 foo.3 foo.4
    mov eax , ebx mov eax , ebx mov eax , ebx mov eax , ebx
    add eax , [ebp] add eax , [ebp] add eax , [ebp] add eax , [ebp]
    mov ecx , [ebp+4h] mov ecx , [ebp+4h] mov ecx , [ebp+4h] mov ecx , [ebp+4h]
    mov [eax] , ecx mov [eax] , ecx mov [eax] , ecx mov [eax] , ecx
    ret near ret near ret near ret near

    So, because lxf is analytical about the return stack (and, through
    that, about locals), inlining produces the same very good code in all
    these cases.

    You may notice that lxf produces a register-register move less than
    VFX does for FOO.2/FOO.4. That's because VFX decided to modify the
    TOS register (and has to restore it later), whereas lxf decided to
    modify a copy of that register. One would have to make additional
    observations to determine if lxf was just lucky here or if it
    consistently makes the right decision in such cases.

    And here's the code that gforth-fast (which does not have an
    analytical optimizer) produces:

    foo.1 foo.2 foo.3 foo.4
    r 1->0 third 1->1 >l >l 1->1 dup 1->1
    mov -8[r14],r13 mov [r10],r13 >l 1->1 mov [r10],r13
    sub r14,$08 sub r10,$08 mov -$08[rbp],r13 sub r10,$08 2dup 0->2 mov r13,$18[r10] mov rdx,$08[r10] 2over 1->3
    mov r13,$10[r10] third 1->2 mov rax,rbp mov r15,$18[r10]
    mov r15,$08[r10] mov r15,$10[r10] add r10,$10 mov r9,$10[r10]
    i 2->3 third 2->3 lea rbp,-$10[rbp] rot 3->3
    mov r9,[r14] mov r9,$08[r10] mov -$10[rax],rdx mov rax,r13 -rot 3->2 + 3->2 mov r13,[r10] mov r13,r15
    mov [r10],r9 add r15,r9 >l @local0 1->1 mov r15,r9
    sub r10,$08 ! 2->0 @local0 1->1 mov r9,rax
    2->3 mov [r15],r13 mov rax,rbp + 3->2
    mov r9,[r14] ;s 0->1 lea rbp,-$08[rbp] add r15,r9
    add r14,$08 mov r13,$08[r10] mov -$08[rax],r13 ! 2->0
    + 3->2 add r10,$08 @local1 1->2 mov [r15],r13
    add r15,r9 mov rbx,[r14] mov r15,$08[rbp] ;s 0->1
    ! 2->0 add r14,$08 @local2 2->3 mov r13,$08[r10]
    mov [r15],r13 mov rax,[rbx] mov r9,$10[rbp] add r10,$08 ;s 0->1 jmp eax @local0 3->1 mov rbx,[r14]
    mov r13,$08[r10] mov -$10[r10],r9 add r14,$08
    add r10,$08 sub r10,$18 mov rax,[rbx]
    mov rbx,[r14] mov $10[r10],r15 jmp eax
    add r14,$08 mov $18[r10],r13
    mov rax,[rbx] mov r13,$00[rbp]
    jmp eax @local1 1->2
    mov r15,$08[rbp]
    @local2 2->3
    mov r9,$10[rbp]
    + 3->2
    add r15,r9
    ! 2->0
    mov [r15],r13
    lit 0->1
    #24
    mov r13,$60[rbx]
    lp+! 1->1
    add r10,$08
    add rbp,r13
    mov r13,[r10]
    ;s 1->1
    mov rbx,[r14]
    add r14,$08
    mov rax,[rbx]
    jmp eax

    Here inlining helps a little, but the disadvantages of the approach
    are still obvious. With less optimization (e.g., no stack caching),
    inlining would have helped even less.

    And while we are at it, here's SwiftForth:


    foo.1 foo.2 foo.3 foo.4
    RBX PUSH -8 [RBP] RBP LEA -8 [RBP] RBP LEA -8 [RBP] RBP LEA
    0 [RBP] RBX MOV RBX 0 [RBP] MOV RBX 0 [RBP] MOV RBX 0 [RBP] MOV
    8 [RBP] RBP LEA 10 [RBP] RBX MOV 18 # EBX MOV -10 [RBP] RBP LEA -10 [RBP] RBP LEA -8 [RBP] RBP LEA LSPACE CALL RBX 8 [RBP] MOV
    RBX 8 [RBP] MOV RBX 0 [RBP] MOV RBX 10 [R13] MOV 20 [RBP] RAX MOV
    10 [RBP] RAX MOV 10 [RBP] RBX MOV 0 [RBP] RBX MOV RAX 0 [RBP] MOV
    RAX 0 [RBP] MOV -8 [RBP] RBP LEA 8 [RBP] RBP LEA 18 [RBP] RBX MOV
    -8 [RBP] RBP LEA RBX 0 [RBP] MOV RBX 8 [R13] MOV RBX RCX MOV
    RBX 0 [RBP] MOV 10 [RBP] RBX MOV 0 [RBP] RBX MOV 8 [RBP] RBX MOV
    0 [RSP] RBX MOV 0 [RBP] RAX MOV 8 [RBP] RBP LEA 0 [RBP] RAX MOV
    RBX RCX MOV 8 [RBP] RCX MOV RBX 0 [R13] MOV RAX 8 [RBP] MOV
    8 [RBP] RAX MOV RCX 0 [RBX] [RAX] MOV 0 [RBP] RBX MOV RCX 0 [RBP] MOV
    0 [RBP] RBX MOV 10 [RBP] RBX MOV 8 [RBP] RBP LEA 0 [RBP] RAX MOV
    RAX 0 [RBP] MOV 18 [RBP] RBP LEA -8 [RBP] RBP LEA 8 [RBP] RCX MOV
    RCX 8 [RBP] MOV RET RBX 0 [RBP] MOV RCX 0 [RBX] [RAX] MOV
    RAX POP 0 [R13] RBX MOV 10 [RBP] RBX MOV
    RAX RBX ADD -8 [RBP] RBP LEA 18 [RBP] RBP LEA
    0 [RBP] RAX MOV RBX 0 [RBP] MOV RET
    RAX 0 [RBX] MOV 8 [R13] RBX MOV
    8 [RBP] RBX MOV -8 [RBP] RBP LEA
    10 [RBP] RBP LEA RBX 0 [RBP] MOV
    RET 10 [R13] RBX MOV
    -8 [RBP] RBP LEA
    RBX 0 [RBP] MOV
    0 [R13] RBX MOV
    -8 [RBP] RBP LEA
    RBX 0 [RBP] MOV
    8 [R13] RBX MOV
    -8 [RBP] RBP LEA
    RBX 0 [RBP] MOV
    10 [R13] RBX MOV
    0 [RBP] RAX MOV
    8 [RBP] RCX MOV
    RCX 0 [RBX] [RAX] MOV
    10 [RBP] RBX MOV
    18 [RBP] RBP LEA
    RET


    And finally, iForth:

    foo.1/foo.4 foo.2 foo.3
    pop rbx mov rbx, [rsp #16 +] qword pop rbx
    pop rdi mov rcx, rbx lea rsi, [rsi #-16 +] qword pop rax mov rbx, [rsp 8 +] qword mov [esi] dword, rbx
    mov [edi ebx*1]dword,rax push rcx pop rbx
    push rax mov rcx, rbx lea rsi, [rsi #-16 +] qword push rdi mov rbx, [rsp 8 +] qword mov [esi] dword, rbx
    push rbx pop rdi pop rbx
    ; mov [ecx ebx*1] dword, rdi lea rsi, [rsi #-16 +] qword
    ; mov [esi] dword, rbx
    mov rbx, [rsi #16 +] qword
    add rbx, [rsi #32 +] qword
    mov rdi, [rsi] qword
    mov rax, [rsi #16 +] qword
    mov rdx, [rsi #32 +] qword
    mov [ebx] dword, rdi
    push rdi
    push rax
    push rdx
    add rsi, #48 b#
    ;

    It's interesting that Gforth produced the same code for FOO.1 and
    FOO.4, but different code for FOO.2. Both variants are suboptimal
    IMO, because the contain unnecessary pushes.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sun Oct 5 11:29:41 2025
    From Newsgroup: comp.lang.forth

    On 02-10-2025 22:44, Anton Ertl wrote:
    Locals-haters, come to Gforth, where locals are implemented inefficiently:-). The code for 3DUP.2 is actually optimal for
    Gforth's calling convention.

    I think I can beat you at that :)

    1. This is the LOCAL definition

    Addr| Opcode Operand Argument

    35| branch 46 local
    36| r> 0
    37| swap 0
    38| dup 0
    39| >r 0
    40| @ 0
    41| >r 0
    42| execute 0
    43| r> 0
    44| r> 0
    45| ! 0
    46| exit 0

    2. This is the 3DUP with locals


    Addr| Opcode Operand Argument

    54| branch 82 3dup.1
    55| literal 0
    56| to 0 a
    57| literal 0
    58| environ 1
    59| + 0
    60| call 35 local
    61| literal 0
    62| to 1 b
    63| literal 1
    64| environ 1
    65| + 0
    66| call 35 local
    67| literal 0
    68| to 2 c
    69| literal 2
    70| environ 1
    71| + 0
    72| call 35 local
    73| to 2 c
    74| to 1 b
    75| to 0 a
    76| value 0 a
    77| value 1 b
    78| value 2 c
    79| value 0 a
    80| value 1 b
    81| value 2 c
    82| exit 0

    3. This is 3DUP *without* locals

    Addr| Opcode Operand Argument

    83| branch 91 3dup.2
    84| >r 0
    85| over 0
    86| over 0
    87| r@ 0
    88| rot 0
    89| rot 0
    90| r> 0
    91| exit 0

    And this is the sourcecode:

    include lib/anstools.4th
    include 4pp/lib/alocals.4pp

    : clear depth 0 ?do drop loop ;
    : 3dup.1 {: a b c -- a b c a b c :} a b c a b c ;
    : 3dup.2 >r 2dup r@ -rot r> ;

    1 2 3 3dup.1 .s clear
    4 5 6 3dup.2 .s clear

    Yeah, heavy use of the preprocessor. This is the expanded source:

    : 3dup.1
    [UNDEFINED] a [IF] 0 value a [THEN] ['] a >body local
    [UNDEFINED] b [IF] 0 value b [THEN] ['] b >body local
    [UNDEFINED] c [IF] 0 value c [THEN] ['] c >body local
    to c to b to a a b c a b c ;

    Maybe now the decompilation makes sense ;-)

    Hans Bezemer

    dxf <dxforth@gmail.com> writes:
    For 3DUP I believe this is the one to beat:

    : 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

    With NTF/LFX the locals version will break even.

    As we already discussed in the thread including <2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
    same (optimal for the calling convention used by NTF/LXF) code for
    3DUP versions using the data stack, return stack, or locals. That's
    because the actual data flow is always the same, and NTF/LXF can see
    this data flow in all three cases.

    For others, well, it may
    be better not to look. For a straight-forward example of 'stack juggling', >> locals handle it rather poorly.

    Other Forth systems implement locals poorly. LXF/NTF demonstrates
    that this is not due to some natural law, however.

    There have been some improvements in Gforth since that time. Let's
    see how the versions used in that thread look on today's gforth-fast.
    Here are the versions of 3DUP:

    : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    : 3dup.3 {: a b c :} a b c a b c ;
    : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

    And here's the gforth-fast code on AMD64:

    3dup.1 3dup.2 3dup.3 3dup.4
    r 1->0 third 1->2 >l >l 1->1 dup 1->1
    mov -$08[r14],r13 mov r15,$10[r10] >l 1->1 mov [r10],r13
    sub r14,$08 third 2->3 mov -$08[rbp],r13 sub r10,$08 2dup 0->2 mov r9,$08[r10] mov rdx,$08[r10] 2over 1->3
    mov r13,$10[r10] third 3->1 mov rax,rbp mov r15,$18[r10
    mov r15,$08[r10] mov [r10],r13 add r10,$10 mov r9,$10[r10] i 2->3 sub r10,$18 lea rbp,-$10[rbp] rot 3->1
    mov r9,[r14] mov $10[r10],r15 mov -$10[rax],rdx mov [r10],r15 -rot 3->2 mov $08[r10],r9 mov r13,[r10] sub r10,$10
    mov [r10],r9 ;s 1->1 >l @local0 1->1 mov $08[r10],r9
    sub r10,$08 mov rbx,[r14] @local0 1->1 ;s 1->1
    2->1 add r14,$08 mov rax,rbp mov rbx,[r14]
    mov -$08[r10],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
    sub r10,$10 jmp eax mov -$08[rax],r13 mov rax,[rbx]
    mov $10[r10],r13 @local1 1->2 jmp eax
    mov r13,[r14] mov r15,$08[rbp]
    add r14,$08 @local2 2->1
    ;s 1->1 mov -$08[r10],r15
    mov rbx,[r14] sub r10,$10
    add r14,$08 mov $10[r10],r13
    mov rax,[rbx] mov r13,$10[rbp]
    jmp eax @local0 1->2
    mov r15,$00[rbp]
    @local1 2->3
    mov r9,$08[rbp]
    @local2 3->1
    mov -$10[r10],r9
    sub r10,$18
    mov $10[r10],r15
    mov $18[r10],r13
    mov r13,$10[rbp]
    lit 1->2
    #24
    mov r15,$50[rbx]
    lp+! 2->1
    add rbp,r15
    ;s 1->1
    mov rbx,[r14]
    add r14,$08
    mov rax,[rbx]
    jmp eax


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.lang.forth on Wed Oct 15 19:19:56 2025
    From Newsgroup: comp.lang.forth

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    I wanted to generate a random sequence consisting of a number, NOOP,
    and DROP. When EVALUATEing the sequence, starting with an empty
    stack, it should end with an empty stack, and should not underflow the
    stack.

    I implemented this twice. The first attempt just generated a random word/number in the normal case, and a specific kind of word/number in
    the boundary cases:

    : genseq {: u -- :}
    0 u 0 do
    case
    dup 0= ?of 1+ ." 9 " endof
    dup i + u u>= ?of 1- ." drop " endof
    dup i + 1+ u u>= ?of ." noop " endof
    rnd 3 um* nip 1- dup case
    -1 of ." drop " endof
    0 of ." noop " endof
    1 of ." 8 " endof
    endcase
    +
    0 endcase
    loop
    drop ;

    When I do 16 GENSEQ with seed 1234, I get the following sequence:

    9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop

    The requirements are distributed across several places in this word,
    which I did not like. I also did not like the non-randomness of the
    sequence of DROPs at the end, and the non-randomness of the 9 whenever
    the stack is empty. I also found that I needed to add a clause for
    when the stack depth is one less than the number of remaining words.

    So I thought about other ways to do it.

    Eventually I came up with the following approach: Starts with counters
    for the different kinds of words/numbers. Whenever one word/number is generated, its counter is reduced. At each step the probability of generating a kind of word/number is proportional to the remaining
    counter for that word/number. The same-depth requirement is
    implemented by having the same initial counter for the number and
    DROP. The no-underflow requirement is implemented by skipping the
    output of a DROP if the count of remaining DROPs = the count of the
    remaining numbers.

    Here's the code:

    create counts 3 cells allot

    : initcounts ( u -- )
    dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

    : .counts ( -- )
    counts 3 cell array>mem mem+do
    i @ 4 .r
    loop ;

    : countsth-- ( u -- )
    -1 counts rot th +! ;

    : select1 ( u -- )
    \ u is the index of the word
    dup case
    0 of ." 9 " countsth-- endof
    1 of ." noop " countsth-- endof
    2 of counts @ counts 2 th @ u< if
    ." drop " countsth--
    else
    drop endif
    endof
    endcase ;

    : sum ( addr u -- u2 )
    \ addr u is an array with u elements, u2 is the sum of the elements
    0 -rot cell array>mem mem+do
    i @ +
    loop ;

    : genseq ( u -- )
    initcounts
    begin
    counts 3 sum dup while
    rnd um* nip ( sum )
    3 0 u+do ( sum1 )
    counts i th @ 2dup u< if
    i select1 then
    - loop
    drop
    repeat
    drop ;

    The code is much longer, but I like it better. The requirements are concentrated in two places.

    16 GENSEQ generates

    9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop

    for seed 123. A disadvantage of this approach is that it does not
    generate a word/number on each iteration through the main loop. I
    have thought about how to achieve this, but the result would be quite
    a bit more complex; it's also only a minor issue. E.g., for
    generating 1000 words/numbers with seed 123, the main loop has 1036 iterations.

    The second approach takes quite a bit more time; for generating a
    million words/numbers into a string on a Zen 4, 192_876_892 cycles
    compared to 146_644_156 cycles.

    The original goal was to produce a sequence where various parts of the
    text interpreter have branch mispredictions; the less predictable the sequence, the better. In this respect the second sequence is
    significantly better. For a sequence of 1000 words/numbers,
    text-interpreted 1000 times, the number of mispredictions is:

    728_322 first approach
    821_813 second approach

    (with 247_125 mispredictions coming from (mainly) Gforth startup).

    In related problem I used the following technique. Basic part
    is: at each position you need to generate next step with right
    probablity. How to know probabilities? Just count all possible
    continuations. To do this with reasonable efficiency start
    from the end and observe that legality of a continuation depends
    only on number of elements on the stack, but does not depend on
    sequence of operations that put the items on the stack. Once
    you know counts for all contunuations of length k you can compute
    counts for contunuations of length k + 1 (each possible first step
    leads to new state for which count is known).

    This produces uniform distribution, so in this sense gives
    pretty good randomness. OTOH size of table of counts grows
    quadratically with length, so it may be too costly to
    compute for long sequences. But if one wants several sequences
    of modest length (my case) cost of table is amortized over
    many uses.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2