Forum: Too Lazy BBS

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	27
Nodes:	6 (0 / 6)
Uptime:	38:00:19
Calls:	631
Calls today:	2
Files:	1,187
D/L today:	22 files (29,767K bytes)
Messages:	173,681

Generating a random sequence of Forth words

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Sep 30 16:33:50 2025

From Newsgroup: comp.lang.forth

I wanted to generate a random sequence consisting of a number, NOOP,
and DROP. When EVALUATEing the sequence, starting with an empty
stack, it should end with an empty stack, and should not underflow the
stack.

I implemented this twice. The first attempt just generated a random word/number in the normal case, and a specific kind of word/number in
the boundary cases:

: genseq {: u -- :}
0 u 0 do
case
dup 0= ?of 1+ ." 9 " endof
dup i + u u>= ?of 1- ." drop " endof
dup i + 1+ u u>= ?of ." noop " endof
rnd 3 um* nip 1- dup case
-1 of ." drop " endof
0 of ." noop " endof
1 of ." 8 " endof
endcase
+
0 endcase
loop
drop ;

When I do 16 GENSEQ with seed 1234, I get the following sequence:

9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop

The requirements are distributed across several places in this word,
which I did not like. I also did not like the non-randomness of the
sequence of DROPs at the end, and the non-randomness of the 9 whenever
the stack is empty. I also found that I needed to add a clause for
when the stack depth is one less than the number of remaining words.

So I thought about other ways to do it.

Eventually I came up with the following approach: Starts with counters
for the different kinds of words/numbers. Whenever one word/number is generated, its counter is reduced. At each step the probability of
generating a kind of word/number is proportional to the remaining
counter for that word/number. The same-depth requirement is
implemented by having the same initial counter for the number and
DROP. The no-underflow requirement is implemented by skipping the
output of a DROP if the count of remaining DROPs = the count of the
remaining numbers.

Here's the code:

create counts 3 cells allot

: initcounts ( u -- )
dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

: .counts ( -- )
counts 3 cell array>mem mem+do
i @ 4 .r
loop ;

: countsth-- ( u -- )
-1 counts rot th +! ;

: select1 ( u -- )
\ u is the index of the word
dup case
0 of ." 9 " countsth-- endof
1 of ." noop " countsth-- endof
2 of counts @ counts 2 th @ u< if
." drop " countsth--
else
drop endif
endof
endcase ;

: sum ( addr u -- u2 )
\ addr u is an array with u elements, u2 is the sum of the elements
0 -rot cell array>mem mem+do
i @ +
loop ;

: genseq ( u -- )
initcounts
begin
counts 3 sum dup while
rnd um* nip ( sum )
3 0 u+do ( sum1 )
counts i th @ 2dup u< if
i select1 then
- loop
drop
repeat
drop ;

The code is much longer, but I like it better. The requirements are concentrated in two places.

16 GENSEQ generates

9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop

for seed 123. A disadvantage of this approach is that it does not
generate a word/number on each iteration through the main loop. I
have thought about how to achieve this, but the result would be quite
a bit more complex; it's also only a minor issue. E.g., for
generating 1000 words/numbers with seed 123, the main loop has 1036
iterations.

The second approach takes quite a bit more time; for generating a
million words/numbers into a string on a Zen 4, 192_876_892 cycles
compared to 146_644_156 cycles.

The original goal was to produce a sequence where various parts of the
text interpreter have branch mispredictions; the less predictable the
sequence, the better. In this respect the second sequence is
significantly better. For a sequence of 1000 words/numbers,
text-interpreted 1000 times, the number of mispredictions is:

728_322 first approach
821_813 second approach

(with 247_125 mispredictions coming from (mainly) Gforth startup).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From minforth@minforth@gmx.net to comp.lang.forth on Wed Oct 1 11:20:56 2025

From Newsgroup: comp.lang.forth

Am 30.09.2025 um 18:33 schrieb Anton Ertl:

I wanted to generate a random sequence consisting of a number, NOOP,
and DROP. When EVALUATEing the sequence, starting with an empty
stack, it should end with an empty stack, and should not underflow the
stack.

I implemented this twice. The first attempt just generated a random word/number in the normal case, and a specific kind of word/number in
the boundary cases:

: genseq {: u -- :}
0 u 0 do
case
dup 0= ?of 1+ ." 9 " endof
dup i + u u>= ?of 1- ." drop " endof
dup i + 1+ u u>= ?of ." noop " endof
rnd 3 um* nip 1- dup case
-1 of ." drop " endof
0 of ." noop " endof
1 of ." 8 " endof
endcase
+
0 endcase
loop
drop ;

When I do 16 GENSEQ with seed 1234, I get the following sequence:

9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop

The requirements are distributed across several places in this word,
which I did not like. I also did not like the non-randomness of the
sequence of DROPs at the end, and the non-randomness of the 9 whenever
the stack is empty. I also found that I needed to add a clause for
when the stack depth is one less than the number of remaining words.

So I thought about other ways to do it.

Eventually I came up with the following approach: Starts with counters
for the different kinds of words/numbers. Whenever one word/number is generated, its counter is reduced. At each step the probability of generating a kind of word/number is proportional to the remaining
counter for that word/number. The same-depth requirement is
implemented by having the same initial counter for the number and
DROP. The no-underflow requirement is implemented by skipping the
output of a DROP if the count of remaining DROPs = the count of the
remaining numbers.

Here's the code:

create counts 3 cells allot

: initcounts ( u -- )
dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

: .counts ( -- )
counts 3 cell array>mem mem+do
i @ 4 .r
loop ;

: countsth-- ( u -- )
-1 counts rot th +! ;

: select1 ( u -- )
\ u is the index of the word
dup case
0 of ." 9 " countsth-- endof
1 of ." noop " countsth-- endof
2 of counts @ counts 2 th @ u< if
." drop " countsth--
else
drop endif
endof
endcase ;

: sum ( addr u -- u2 )
\ addr u is an array with u elements, u2 is the sum of the elements
0 -rot cell array>mem mem+do
i @ +
loop ;

: genseq ( u -- )
initcounts
begin
counts 3 sum dup while
rnd um* nip ( sum )
3 0 u+do ( sum1 )
counts i th @ 2dup u< if
i select1 then
- loop
drop
repeat
drop ;

The code is much longer, but I like it better. The requirements are concentrated in two places.

16 GENSEQ generates

9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop

for seed 123. A disadvantage of this approach is that it does not
generate a word/number on each iteration through the main loop. I
have thought about how to achieve this, but the result would be quite
a bit more complex; it's also only a minor issue. E.g., for
generating 1000 words/numbers with seed 123, the main loop has 1036 iterations.

The second approach takes quite a bit more time; for generating a
million words/numbers into a string on a Zen 4, 192_876_892 cycles
compared to 146_644_156 cycles.

The original goal was to produce a sequence where various parts of the
text interpreter have branch mispredictions; the less predictable the sequence, the better. In this respect the second sequence is
significantly better. For a sequence of 1000 words/numbers,
text-interpreted 1000 times, the number of mispredictions is:

728_322 first approach
821_813 second approach

(with 247_125 mispredictions coming from (mainly) Gforth startup).

How about
1) quick generation of a suboptimal list
2) create an array of pointers to the list elements
3) random shuffle the array e.g. https://en.wikipedia.org/wiki/Fisher-Yates_shuffle

--- Synchronet 3.21a-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Wed Oct 1 17:11:21 2025

From Newsgroup: comp.lang.forth

I used something similar - but with a whole slew of stack operations.
Very handy thingy. I use it until this day.

( abc -- abcabc) >r over over r@ rot rot r> \ >r 2dup r@ -rot r>

Hans Bezemer

On 30-09-2025 18:33, Anton Ertl wrote:

I wanted to generate a random sequence consisting of a number, NOOP,
and DROP. When EVALUATEing the sequence, starting with an empty
stack, it should end with an empty stack, and should not underflow the
stack.

I implemented this twice. The first attempt just generated a random word/number in the normal case, and a specific kind of word/number in
the boundary cases:

: genseq {: u -- :}
0 u 0 do
case
dup 0= ?of 1+ ." 9 " endof
dup i + u u>= ?of 1- ." drop " endof
dup i + 1+ u u>= ?of ." noop " endof
rnd 3 um* nip 1- dup case
-1 of ." drop " endof
0 of ." noop " endof
1 of ." 8 " endof
endcase
+
0 endcase
loop
drop ;

When I do 16 GENSEQ with seed 1234, I get the following sequence:

9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop

The requirements are distributed across several places in this word,
which I did not like. I also did not like the non-randomness of the
sequence of DROPs at the end, and the non-randomness of the 9 whenever
the stack is empty. I also found that I needed to add a clause for
when the stack depth is one less than the number of remaining words.

So I thought about other ways to do it.

Eventually I came up with the following approach: Starts with counters
for the different kinds of words/numbers. Whenever one word/number is generated, its counter is reduced. At each step the probability of generating a kind of word/number is proportional to the remaining
counter for that word/number. The same-depth requirement is
implemented by having the same initial counter for the number and
DROP. The no-underflow requirement is implemented by skipping the
output of a DROP if the count of remaining DROPs = the count of the
remaining numbers.

Here's the code:

create counts 3 cells allot

: initcounts ( u -- )
dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

: .counts ( -- )
counts 3 cell array>mem mem+do
i @ 4 .r
loop ;

: countsth-- ( u -- )
-1 counts rot th +! ;

: select1 ( u -- )
\ u is the index of the word
dup case
0 of ." 9 " countsth-- endof
1 of ." noop " countsth-- endof
2 of counts @ counts 2 th @ u< if
." drop " countsth--
else
drop endif
endof
endcase ;

: sum ( addr u -- u2 )
\ addr u is an array with u elements, u2 is the sum of the elements
0 -rot cell array>mem mem+do
i @ +
loop ;

: genseq ( u -- )
initcounts
begin
counts 3 sum dup while
rnd um* nip ( sum )
3 0 u+do ( sum1 )
counts i th @ 2dup u< if
i select1 then
- loop
drop
repeat
drop ;

The code is much longer, but I like it better. The requirements are concentrated in two places.

16 GENSEQ generates

9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop

for seed 123. A disadvantage of this approach is that it does not
generate a word/number on each iteration through the main loop. I
have thought about how to achieve this, but the result would be quite
a bit more complex; it's also only a minor issue. E.g., for
generating 1000 words/numbers with seed 123, the main loop has 1036 iterations.

The second approach takes quite a bit more time; for generating a
million words/numbers into a string on a Zen 4, 192_876_892 cycles
compared to 146_644_156 cycles.

The original goal was to produce a sequence where various parts of the
text interpreter have branch mispredictions; the less predictable the sequence, the better. In this respect the second sequence is
significantly better. For a sequence of 1000 words/numbers,
text-interpreted 1000 times, the number of mispredictions is:

728_322 first approach
821_813 second approach

(with 247_125 mispredictions coming from (mainly) Gforth startup).

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Oct 1 17:10:34 2025

From Newsgroup: comp.lang.forth

minforth <minforth@gmx.net> writes:

Am 30.09.2025 um 18:33 schrieb Anton Ertl:

Eventually I came up with the following approach: Starts with counters
for the different kinds of words/numbers. Whenever one word/number is
generated, its counter is reduced. At each step the probability of
generating a kind of word/number is proportional to the remaining
counter for that word/number. The same-depth requirement is
implemented by having the same initial counter for the number and
DROP. The no-underflow requirement is implemented by skipping the
output of a DROP if the count of remaining DROPs = the count of the
remaining numbers.

Here's the code:

create counts 3 cells allot

: initcounts ( u -- )
dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

: .counts ( -- )
counts 3 cell array>mem mem+do
i @ 4 .r
loop ;

: countsth-- ( u -- )
-1 counts rot th +! ;

: select1 ( u -- )
\ u is the index of the word
dup case
0 of ." 9 " countsth-- endof
1 of ." noop " countsth-- endof
2 of counts @ counts 2 th @ u< if
." drop " countsth--
else
drop endif
endof
endcase ;

: sum ( addr u -- u2 )
\ addr u is an array with u elements, u2 is the sum of the elements
0 -rot cell array>mem mem+do
i @ +
loop ;

: genseq ( u -- )
initcounts
begin
counts 3 sum dup while
rnd um* nip ( sum )
3 0 u+do ( sum1 )
counts i th @ 2dup u< if
i select1 then
- loop
drop
repeat
drop ;

...

How about
1) quick generation of a suboptimal list
2) create an array of pointers to the list elements
3) random shuffle the array e.g. >https://en.wikipedia.org/wiki/Fisher-Yates_shuffle

I thought about an approach based on this in between, with appropriate
changes to satisfy the requirements. Then I continued onwards to the
second approach shown above. The counters are a kind of compressed
form of "a suboptimal list". Every time you pick one of the elements
of the suboptimal list, the counter for that kind of element is
lowered by 1.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From minforth@minforth@gmx.net to comp.lang.forth on Wed Oct 1 20:42:37 2025

From Newsgroup: comp.lang.forth

Am 01.10.2025 um 17:11 schrieb Hans Bezemer:

I used something similar - but with a whole slew of stack operations.
Very handy thingy. I use it until this day.

( abc -- abcabc)-a-a >r over over r@ rot rot r> \ >r 2dup r@ -rot r>

Different and probably not as efficient as your code generator:

MinForth 3.6
# : 3DUP { a b c == a b c a b c } ;

--- Synchronet 3.21a-Linux NewsLink 1.2

From dxf@dxforth@gmail.com to comp.lang.forth on Thu Oct 2 19:49:29 2025

From Newsgroup: comp.lang.forth

On 2/10/2025 4:42 am, minforth wrote:

Am 01.10.2025 um 17:11 schrieb Hans Bezemer:

I used something similar - but with a whole slew of stack operations. Very handy thingy. I use it until this day.

( abc -- abcabc)-a-a >r over over r@ rot rot r> \ >r 2dup r@ -rot r>

Different and probably not as efficient as your code generator:

MinForth 3.6
# : 3DUP { a b c == a b c a b c } ;

For 3DUP I believe this is the one to beat:

: 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

With NTF/LFX the locals version will break even. For others, well, it may
be better not to look. For a straight-forward example of 'stack juggling', locals handle it rather poorly.

--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Oct 2 13:07:54 2025

From Newsgroup: comp.lang.forth

In article <68de4aaa@news.ausics.net>, dxf <dxforth@gmail.com> wrote:

On 2/10/2025 4:42 am, minforth wrote:

Am 01.10.2025 um 17:11 schrieb Hans Bezemer:

I used something similar - but with a whole slew of stack operations.

Very handy thingy. I use it until this day.

( abc -- abcabc)-a-a >r over over r@ rot rot r> \ >r 2dup r@ -rot r>

Different and probably not as efficient as your code generator:

MinForth 3.6
# : 3DUP { a b c == a b c a b c } ;

For 3DUP I believe this is the one to beat:

: 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

CODE 3DUP
PUSH RSP[3*CELL_SIZE]
PUSH RSP[3*CELL_SIZE]
PUSH RSP[3*CELL_SIZE]
NEXT,

How does that look for 5DUP ?

With NTF/LFX the locals version will break even. For others, well, it may
be better not to look. For a straight-forward example of 'stack juggling', >locals handle it rather poorly.

Groetjes Albert
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Oct 2 20:44:40 2025

From Newsgroup: comp.lang.forth

dxf <dxforth@gmail.com> writes:

For 3DUP I believe this is the one to beat:

: 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

With NTF/LFX the locals version will break even.

As we already discussed in the thread including <2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
same (optimal for the calling convention used by NTF/LXF) code for
3DUP versions using the data stack, return stack, or locals. That's
because the actual data flow is always the same, and NTF/LXF can see
this data flow in all three cases.

For others, well, it may
be better not to look. For a straight-forward example of 'stack juggling', >locals handle it rather poorly.

Other Forth systems implement locals poorly. LXF/NTF demonstrates
that this is not due to some natural law, however.

There have been some improvements in Gforth since that time. Let's
see how the versions used in that thread look on today's gforth-fast.
Here are the versions of 3DUP:

: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
: 3dup.3 {: a b c :} a b c a b c ;
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

And here's the gforth-fast code on AMD64:

3dup.1 3dup.2 3dup.3 3dup.4

r 1->0 third 1->2 >l >l 1->1 dup 1->1

mov -$08[r14],r13 mov r15,$10[r10] >l 1->1 mov [r10],r13
sub r14,$08 third 2->3 mov -$08[rbp],r13 sub r10,$08 2dup 0->2 mov r9,$08[r10] mov rdx,$08[r10] 2over 1->3
mov r13,$10[r10] third 3->1 mov rax,rbp mov r15,$18[r10
mov r15,$08[r10] mov [r10],r13 add r10,$10 mov r9,$10[r10]
i 2->3 sub r10,$18 lea rbp,-$10[rbp] rot 3->1
mov r9,[r14] mov $10[r10],r15 mov -$10[rax],rdx mov [r10],r15 -rot 3->2 mov $08[r10],r9 mov r13,[r10] sub r10,$10
mov [r10],r9 ;s 1->1 >l @local0 1->1 mov $08[r10],r9
sub r10,$08 mov rbx,[r14] @local0 1->1 ;s 1->1

2->1 add r14,$08 mov rax,rbp mov rbx,[r14]

mov -$08[r10],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
sub r10,$10 jmp eax mov -$08[rax],r13 mov rax,[rbx]
mov $10[r10],r13 @local1 1->2 jmp eax
mov r13,[r14] mov r15,$08[rbp]
add r14,$08 @local2 2->1
;s 1->1 mov -$08[r10],r15
mov rbx,[r14] sub r10,$10
add r14,$08 mov $10[r10],r13
mov rax,[rbx] mov r13,$10[rbp]
jmp eax @local0 1->2
mov r15,$00[rbp]
@local1 2->3
mov r9,$08[rbp]
@local2 3->1
mov -$10[r10],r9
sub r10,$18
mov $10[r10],r15
mov $18[r10],r13
mov r13,$10[rbp]
lit 1->2
#24
mov r15,$50[rbx]
lp+! 2->1
add rbp,r15
;s 1->1
mov rbx,[r14]
add r14,$08
mov rax,[rbx]
jmp eax

Locals-haters, come to Gforth, where locals are implemented
inefficiently:-). The code for 3DUP.2 is actually optimal for
Gforth's calling convention.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From dxf@dxforth@gmail.com to comp.lang.forth on Fri Oct 3 18:22:37 2025

From Newsgroup: comp.lang.forth

On 2/10/2025 9:07 pm, albert@spenarnc.xs4all.nl wrote:

In article <68de4aaa@news.ausics.net>, dxf <dxforth@gmail.com> wrote:

On 2/10/2025 4:42 am, minforth wrote:

Am 01.10.2025 um 17:11 schrieb Hans Bezemer:

I used something similar - but with a whole slew of stack operations.

Very handy thingy. I use it until this day.

( abc -- abcabc)-a-a >r over over r@ rot rot r> \ >r 2dup r@ -rot r>

Different and probably not as efficient as your code generator:

MinForth 3.6
# : 3DUP { a b c == a b c a b c } ;

For 3DUP I believe this is the one to beat:

: 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

CODE 3DUP
PUSH RSP[3*CELL_SIZE]
PUSH RSP[3*CELL_SIZE]
PUSH RSP[3*CELL_SIZE]
NEXT,
...

That's equivalent to:

: 3DUP 2 PICK 2 PICK 2 PICK ;

I didn't notice but it's how VFX defines it. OTOH DUP 2OVER ROT
optimizes to the same machine code. IIRC I tested them all on
DX-Forth and settled on the latter. An 8086 code version would
beat it. An 8080 code version would be more trouble than worth.

--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Fri Oct 3 11:02:33 2025

From Newsgroup: comp.lang.forth

In article <2025Oct2.224440@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

dxf <dxforth@gmail.com> writes:

For 3DUP I believe this is the one to beat:

: 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

With NTF/LFX the locals version will break even.

As we already discussed in the thread including ><2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
same (optimal for the calling convention used by NTF/LXF) code for
3DUP versions using the data stack, return stack, or locals. That's
because the actual data flow is always the same, and NTF/LXF can see
this data flow in all three cases.

The problem with 3DUP is that it is actually used in context.
What is the data that is going to 3DUP ped? In my view this
amounts to double use of data that is in registers (32 in the riscv)
anyway, after an optimiser does his thing.

<SNIP>

- anton

--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

From minforth@minforth@gmx.net to comp.lang.forth on Fri Oct 3 11:09:07 2025

From Newsgroup: comp.lang.forth

Am 03.10.2025 um 11:02 schrieb albert@spenarnc.xs4all.nl:

In article <2025Oct2.224440@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

dxf <dxforth@gmail.com> writes:

For 3DUP I believe this is the one to beat:

: 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

With NTF/LFX the locals version will break even.

As we already discussed in the thread including
<2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
same (optimal for the calling convention used by NTF/LXF) code for
3DUP versions using the data stack, return stack, or locals. That's
because the actual data flow is always the same, and NTF/LXF can see
this data flow in all three cases.

The problem with 3DUP is that it is actually used in context.
What is the data that is going to 3DUP ped? In my view this
amounts to double use of data that is in registers (32 in the riscv)
anyway, after an optimiser does his thing.

Code inlining will mend it.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Oct 4 08:04:09 2025

From Newsgroup: comp.lang.forth

minforth <minforth@gmx.net> writes:

Am 03.10.2025 um 11:02 schrieb albert@spenarnc.xs4all.nl:

The problem with 3DUP is that it is actually used in context.
What is the data that is going to 3DUP ped? In my view this
amounts to double use of data that is in registers (32 in the riscv)
anyway, after an optimiser does his thing.

Code inlining will mend it.

Inlining is important for Forth, but it does not make what has been
called an "analytical optimizer" unnecessary; on the contraray,
inlining increases the benefit we get from the analytical optimizer.
E.g., let's consider

: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
: 3dup.3 {: a b c :} a b c a b c ;
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

: foo.1 3dup.1 + ! ;
: foo.2 3dup.2 + ! ;
: foo.3 3dup.3 + ! ;
: foo.4 3dup.4 + ! ;

The result produced by VFX64 is:

foo.1 foo.2 foo.3 foo.4
PUSH EBX MOV EDX, EBX CALL 3DUP.3 MOV EDX, EBX
MOV EBX, [ESP] ADD EBX, [EBP] ADD EBX, [EBP] ADD EBX, [EBP]
POP EDX MOV ECX, [EBP+04] MOV EDX, [EBP+04] MOV ECX, [EBP+04]
ADD EDX, [EBP] MOV 0 [EBX], ECX MOV 0 [EBX], EDX MOV 0 [EBX], ECX
MOV ECX, [EBP+04] MOV EBX, EDX MOV EBX, [EBP+08] MOV EBX, EDX
MOV 0 [EDX], ECX NEXT, LEA EBP, [EBP+0C] NEXT, NEXT, NEXT,

VFX is only analytical about the data stack, and as a consequence, the implementations of 3dup that only use the data stack work best. When
the return stack is used, as in 3dup.1/foo.1, VFX produces
instructions (PUSH for >R, MOV ..., [ESP] for R@ and POP for R>) for
the return-stack operations. When locals are used, VFX actually
disables inlining and just calls 3DUP.3.

Other Forth systems make too little use of inlining, and I have to
resort to macros to simulate it. We cannot use proper macros for
3dup.3 (the locals-using variant), so I used EVALUATE-based macros;
this is just for experimental use, not for production, don't do this
at home:-)

Let's see what VFX64 produces for FOO.3 with this:

FOO.3
( 080C0C50 8BD4 ) MOV EDX, ESP
( 080C0C52 FF7504 ) PUSH [EBP+04]
( 080C0C55 FF7500 ) PUSH [EBP]
( 080C0C58 53 ) PUSH EBX
( 080C0C59 52 ) PUSH EDX
( 080C0C5A 57 ) PUSH EDI
( 080C0C5B 8BFC ) MOV EDI, ESP
( 080C0C5D 81EC00000000 ) SUB ESP, 00000000
( 080C0C63 8B5D08 ) MOV EBX, [EBP+08]
( 080C0C66 8D6D0C ) LEA EBP, [EBP+0C]
( 080C0C69 8B5708 ) MOV EDX, [EDI+08]
( 080C0C6C 03570C ) ADD EDX, [EDI+0C]
( 080C0C6F 8B4F08 ) MOV ECX, [EDI+08]
( 080C0C72 8B470C ) MOV EAX, [EDI+0C]
( 080C0C75 8D6DEC ) LEA EBP, [EBP+-14]
( 080C0C78 894D04 ) MOV [EBP+04], ECX
( 080C0C7B 894508 ) MOV [EBP+08], EAX
( 080C0C7E 8B4F10 ) MOV ECX, [EDI+10]
( 080C0C81 894D0C ) MOV [EBP+0C], ECX
( 080C0C84 895D10 ) MOV [EBP+10], EBX
( 080C0C87 8BDA ) MOV EBX, EDX
( 080C0C89 8B5710 ) MOV EDX, [EDI+10]
( 080C0C8C 895500 ) MOV [EBP], EDX
( 080C0C8F 8B5500 ) MOV EDX, [EBP]
( 080C0C92 8913 ) MOV 0 [EBX], EDX
( 080C0C94 8B5D04 ) MOV EBX, [EBP+04]
( 080C0C97 8D6D08 ) LEA EBP, [EBP+08]
( 080C0C9A 8B6704 ) MOV ESP, [EDI+04]
( 080C0C9D 8B3F ) MOV EDI, 0 [EDI]
( 080C0C9F C3 ) NEXT,

So inlining did not mend that.

Here's what lxf produces:

foo.1 foo.2 foo.3 foo.4
mov eax , ebx mov eax , ebx mov eax , ebx mov eax , ebx
add eax , [ebp] add eax , [ebp] add eax , [ebp] add eax , [ebp]
mov ecx , [ebp+4h] mov ecx , [ebp+4h] mov ecx , [ebp+4h] mov ecx , [ebp+4h]
mov [eax] , ecx mov [eax] , ecx mov [eax] , ecx mov [eax] , ecx
ret near ret near ret near ret near

So, because lxf is analytical about the return stack (and, through
that, about locals), inlining produces the same very good code in all
these cases.

You may notice that lxf produces a register-register move less than
VFX does for FOO.2/FOO.4. That's because VFX decided to modify the
TOS register (and has to restore it later), whereas lxf decided to
modify a copy of that register. One would have to make additional
observations to determine if lxf was just lucky here or if it
consistently makes the right decision in such cases.

And here's the code that gforth-fast (which does not have an
analytical optimizer) produces:

foo.1 foo.2 foo.3 foo.4

r 1->0 third 1->1 >l >l 1->1 dup 1->1

mov -8[r14],r13 mov [r10],r13 >l 1->1 mov [r10],r13
sub r14,$08 sub r10,$08 mov -$08[rbp],r13 sub r10,$08 2dup 0->2 mov r13,$18[r10] mov rdx,$08[r10] 2over 1->3
mov r13,$10[r10] third 1->2 mov rax,rbp mov r15,$18[r10]
mov r15,$08[r10] mov r15,$10[r10] add r10,$10 mov r9,$10[r10]
i 2->3 third 2->3 lea rbp,-$10[rbp] rot 3->3
mov r9,[r14] mov r9,$08[r10] mov -$10[rax],rdx mov rax,r13 -rot 3->2 + 3->2 mov r13,[r10] mov r13,r15
mov [r10],r9 add r15,r9 >l @local0 1->1 mov r15,r9
sub r10,$08 ! 2->0 @local0 1->1 mov r9,rax

2->3 mov [r15],r13 mov rax,rbp + 3->2

mov r9,[r14] ;s 0->1 lea rbp,-$08[rbp] add r15,r9
add r14,$08 mov r13,$08[r10] mov -$08[rax],r13 ! 2->0
+ 3->2 add r10,$08 @local1 1->2 mov [r15],r13
add r15,r9 mov rbx,[r14] mov r15,$08[rbp] ;s 0->1
! 2->0 add r14,$08 @local2 2->3 mov r13,$08[r10]
mov [r15],r13 mov rax,[rbx] mov r9,$10[rbp] add r10,$08 ;s 0->1 jmp eax @local0 3->1 mov rbx,[r14]
mov r13,$08[r10] mov -$10[r10],r9 add r14,$08
add r10,$08 sub r10,$18 mov rax,[rbx]
mov rbx,[r14] mov $10[r10],r15 jmp eax
add r14,$08 mov $18[r10],r13
mov rax,[rbx] mov r13,$00[rbp]
jmp eax @local1 1->2
mov r15,$08[rbp]
@local2 2->3
mov r9,$10[rbp]
+ 3->2
add r15,r9
! 2->0
mov [r15],r13
lit 0->1
#24
mov r13,$60[rbx]
lp+! 1->1
add r10,$08
add rbp,r13
mov r13,[r10]
;s 1->1
mov rbx,[r14]
add r14,$08
mov rax,[rbx]
jmp eax

Here inlining helps a little, but the disadvantages of the approach
are still obvious. With less optimization (e.g., no stack caching),
inlining would have helped even less.

And while we are at it, here's SwiftForth:

foo.1 foo.2 foo.3 foo.4
RBX PUSH -8 [RBP] RBP LEA -8 [RBP] RBP LEA -8 [RBP] RBP LEA
0 [RBP] RBX MOV RBX 0 [RBP] MOV RBX 0 [RBP] MOV RBX 0 [RBP] MOV
8 [RBP] RBP LEA 10 [RBP] RBX MOV 18 # EBX MOV -10 [RBP] RBP LEA -10 [RBP] RBP LEA -8 [RBP] RBP LEA LSPACE CALL RBX 8 [RBP] MOV
RBX 8 [RBP] MOV RBX 0 [RBP] MOV RBX 10 [R13] MOV 20 [RBP] RAX MOV
10 [RBP] RAX MOV 10 [RBP] RBX MOV 0 [RBP] RBX MOV RAX 0 [RBP] MOV
RAX 0 [RBP] MOV -8 [RBP] RBP LEA 8 [RBP] RBP LEA 18 [RBP] RBX MOV
-8 [RBP] RBP LEA RBX 0 [RBP] MOV RBX 8 [R13] MOV RBX RCX MOV
RBX 0 [RBP] MOV 10 [RBP] RBX MOV 0 [RBP] RBX MOV 8 [RBP] RBX MOV
0 [RSP] RBX MOV 0 [RBP] RAX MOV 8 [RBP] RBP LEA 0 [RBP] RAX MOV
RBX RCX MOV 8 [RBP] RCX MOV RBX 0 [R13] MOV RAX 8 [RBP] MOV
8 [RBP] RAX MOV RCX 0 [RBX] [RAX] MOV 0 [RBP] RBX MOV RCX 0 [RBP] MOV
0 [RBP] RBX MOV 10 [RBP] RBX MOV 8 [RBP] RBP LEA 0 [RBP] RAX MOV
RAX 0 [RBP] MOV 18 [RBP] RBP LEA -8 [RBP] RBP LEA 8 [RBP] RCX MOV
RCX 8 [RBP] MOV RET RBX 0 [RBP] MOV RCX 0 [RBX] [RAX] MOV
RAX POP 0 [R13] RBX MOV 10 [RBP] RBX MOV
RAX RBX ADD -8 [RBP] RBP LEA 18 [RBP] RBP LEA
0 [RBP] RAX MOV RBX 0 [RBP] MOV RET
RAX 0 [RBX] MOV 8 [R13] RBX MOV
8 [RBP] RBX MOV -8 [RBP] RBP LEA
10 [RBP] RBP LEA RBX 0 [RBP] MOV
RET 10 [R13] RBX MOV
-8 [RBP] RBP LEA
RBX 0 [RBP] MOV
0 [R13] RBX MOV
-8 [RBP] RBP LEA
RBX 0 [RBP] MOV
8 [R13] RBX MOV
-8 [RBP] RBP LEA
RBX 0 [RBP] MOV
10 [R13] RBX MOV
0 [RBP] RAX MOV
8 [RBP] RCX MOV
RCX 0 [RBX] [RAX] MOV
10 [RBP] RBX MOV
18 [RBP] RBP LEA
RET

And finally, iForth:

foo.1/foo.4 foo.2 foo.3
pop rbx mov rbx, [rsp #16 +] qword pop rbx
pop rdi mov rcx, rbx lea rsi, [rsi #-16 +] qword pop rax mov rbx, [rsp 8 +] qword mov [esi] dword, rbx
mov [edi ebx*1]dword,rax push rcx pop rbx
push rax mov rcx, rbx lea rsi, [rsi #-16 +] qword push rdi mov rbx, [rsp 8 +] qword mov [esi] dword, rbx
push rbx pop rdi pop rbx
; mov [ecx ebx*1] dword, rdi lea rsi, [rsi #-16 +] qword
; mov [esi] dword, rbx
mov rbx, [rsi #16 +] qword
add rbx, [rsi #32 +] qword
mov rdi, [rsi] qword
mov rax, [rsi #16 +] qword
mov rdx, [rsi #32 +] qword
mov [ebx] dword, rdi
push rdi
push rax
push rdx
add rsi, #48 b#
;

It's interesting that Gforth produced the same code for FOO.1 and
FOO.4, but different code for FOO.2. Both variants are suboptimal
IMO, because the contain unnecessary pushes.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sun Oct 5 11:29:41 2025

From Newsgroup: comp.lang.forth

On 02-10-2025 22:44, Anton Ertl wrote:

Locals-haters, come to Gforth, where locals are implemented inefficiently:-). The code for 3DUP.2 is actually optimal for
Gforth's calling convention.

I think I can beat you at that :)

1. This is the LOCAL definition

Addr| Opcode Operand Argument

35| branch 46 local
36| r> 0
37| swap 0
38| dup 0
39| >r 0
40| @ 0
41| >r 0
42| execute 0
43| r> 0
44| r> 0
45| ! 0
46| exit 0

2. This is the 3DUP with locals

Addr| Opcode Operand Argument

54| branch 82 3dup.1
55| literal 0
56| to 0 a
57| literal 0
58| environ 1
59| + 0
60| call 35 local
61| literal 0
62| to 1 b
63| literal 1
64| environ 1
65| + 0
66| call 35 local
67| literal 0
68| to 2 c
69| literal 2
70| environ 1
71| + 0
72| call 35 local
73| to 2 c
74| to 1 b
75| to 0 a
76| value 0 a
77| value 1 b
78| value 2 c
79| value 0 a
80| value 1 b
81| value 2 c
82| exit 0

3. This is 3DUP *without* locals

Addr| Opcode Operand Argument

83| branch 91 3dup.2
84| >r 0
85| over 0
86| over 0
87| r@ 0
88| rot 0
89| rot 0
90| r> 0
91| exit 0

And this is the sourcecode:

include lib/anstools.4th
include 4pp/lib/alocals.4pp

: clear depth 0 ?do drop loop ;
: 3dup.1 {: a b c -- a b c a b c :} a b c a b c ;
: 3dup.2 >r 2dup r@ -rot r> ;

1 2 3 3dup.1 .s clear
4 5 6 3dup.2 .s clear

Yeah, heavy use of the preprocessor. This is the expanded source:

: 3dup.1
[UNDEFINED] a [IF] 0 value a [THEN] ['] a >body local
[UNDEFINED] b [IF] 0 value b [THEN] ['] b >body local
[UNDEFINED] c [IF] 0 value c [THEN] ['] c >body local
to c to b to a a b c a b c ;

Maybe now the decompilation makes sense ;-)

Hans Bezemer

dxf <dxforth@gmail.com> writes:

For 3DUP I believe this is the one to beat:

: 3DUP ( a b c -- a b c a b c ) dup 2over rot ;

With NTF/LFX the locals version will break even.

As we already discussed in the thread including <2021Sep11.083507@mips.complang.tuwien.ac.at>, NTF/LXF produces the
same (optimal for the calling convention used by NTF/LXF) code for
3DUP versions using the data stack, return stack, or locals. That's
because the actual data flow is always the same, and NTF/LXF can see
this data flow in all three cases.

For others, well, it may
be better not to look. For a straight-forward example of 'stack juggling', >> locals handle it rather poorly.

Other Forth systems implement locals poorly. LXF/NTF demonstrates
that this is not due to some natural law, however.

There have been some improvements in Gforth since that time. Let's
see how the versions used in that thread look on today's gforth-fast.
Here are the versions of 3DUP:

: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
: 3dup.3 {: a b c :} a b c a b c ;
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

And here's the gforth-fast code on AMD64:

3dup.1 3dup.2 3dup.3 3dup.4

r 1->0 third 1->2 >l >l 1->1 dup 1->1

mov -$08[r14],r13 mov r15,$10[r10] >l 1->1 mov [r10],r13
sub r14,$08 third 2->3 mov -$08[rbp],r13 sub r10,$08 2dup 0->2 mov r9,$08[r10] mov rdx,$08[r10] 2over 1->3
mov r13,$10[r10] third 3->1 mov rax,rbp mov r15,$18[r10
mov r15,$08[r10] mov [r10],r13 add r10,$10 mov r9,$10[r10] i 2->3 sub r10,$18 lea rbp,-$10[rbp] rot 3->1
mov r9,[r14] mov $10[r10],r15 mov -$10[rax],rdx mov [r10],r15 -rot 3->2 mov $08[r10],r9 mov r13,[r10] sub r10,$10
mov [r10],r9 ;s 1->1 >l @local0 1->1 mov $08[r10],r9
sub r10,$08 mov rbx,[r14] @local0 1->1 ;s 1->1

2->1 add r14,$08 mov rax,rbp mov rbx,[r14]

mov -$08[r10],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
sub r10,$10 jmp eax mov -$08[rax],r13 mov rax,[rbx]
mov $10[r10],r13 @local1 1->2 jmp eax
mov r13,[r14] mov r15,$08[rbp]
add r14,$08 @local2 2->1
;s 1->1 mov -$08[r10],r15
mov rbx,[r14] sub r10,$10
add r14,$08 mov $10[r10],r13
mov rax,[rbx] mov r13,$10[rbp]
jmp eax @local0 1->2
mov r15,$00[rbp]
@local1 2->3
mov r9,$08[rbp]
@local2 3->1
mov -$10[r10],r9
sub r10,$18
mov $10[r10],r15
mov $18[r10],r13
mov r13,$10[rbp]
lit 1->2
#24
mov r15,$50[rbx]
lp+! 2->1
add rbp,r15
;s 1->1
mov rbx,[r14]
add r14,$08
mov rax,[rbx]
jmp eax

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.lang.forth on Wed Oct 15 19:19:56 2025

From Newsgroup: comp.lang.forth

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

I wanted to generate a random sequence consisting of a number, NOOP,
and DROP. When EVALUATEing the sequence, starting with an empty
stack, it should end with an empty stack, and should not underflow the
stack.

I implemented this twice. The first attempt just generated a random word/number in the normal case, and a specific kind of word/number in
the boundary cases:

: genseq {: u -- :}
0 u 0 do
case
dup 0= ?of 1+ ." 9 " endof
dup i + u u>= ?of 1- ." drop " endof
dup i + 1+ u u>= ?of ." noop " endof
rnd 3 um* nip 1- dup case
-1 of ." drop " endof
0 of ." noop " endof
1 of ." 8 " endof
endcase
+
0 endcase
loop
drop ;

When I do 16 GENSEQ with seed 1234, I get the following sequence:

9 drop 9 8 8 noop drop drop 8 noop drop noop noop 8 drop drop

The requirements are distributed across several places in this word,
which I did not like. I also did not like the non-randomness of the
sequence of DROPs at the end, and the non-randomness of the 9 whenever
the stack is empty. I also found that I needed to add a clause for
when the stack depth is one less than the number of remaining words.

So I thought about other ways to do it.

Eventually I came up with the following approach: Starts with counters
for the different kinds of words/numbers. Whenever one word/number is generated, its counter is reduced. At each step the probability of generating a kind of word/number is proportional to the remaining
counter for that word/number. The same-depth requirement is
implemented by having the same initial counter for the number and
DROP. The no-underflow requirement is implemented by skipping the
output of a DROP if the count of remaining DROPs = the count of the
remaining numbers.

Here's the code:

create counts 3 cells allot

: initcounts ( u -- )
dup 3 / dup counts ! dup counts 2 th ! 2* - counts cell+ ! ;

: .counts ( -- )
counts 3 cell array>mem mem+do
i @ 4 .r
loop ;

: countsth-- ( u -- )
-1 counts rot th +! ;

: select1 ( u -- )
\ u is the index of the word
dup case
0 of ." 9 " countsth-- endof
1 of ." noop " countsth-- endof
2 of counts @ counts 2 th @ u< if
." drop " countsth--
else
drop endif
endof
endcase ;

: sum ( addr u -- u2 )
\ addr u is an array with u elements, u2 is the sum of the elements
0 -rot cell array>mem mem+do
i @ +
loop ;

: genseq ( u -- )
initcounts
begin
counts 3 sum dup while
rnd um* nip ( sum )
3 0 u+do ( sum1 )
counts i th @ 2dup u< if
i select1 then
- loop
drop
repeat
drop ;

The code is much longer, but I like it better. The requirements are concentrated in two places.

16 GENSEQ generates

9 9 noop noop drop noop 9 drop noop drop noop 9 drop 9 noop drop

for seed 123. A disadvantage of this approach is that it does not
generate a word/number on each iteration through the main loop. I
have thought about how to achieve this, but the result would be quite
a bit more complex; it's also only a minor issue. E.g., for
generating 1000 words/numbers with seed 123, the main loop has 1036 iterations.

The second approach takes quite a bit more time; for generating a
million words/numbers into a string on a Zen 4, 192_876_892 cycles
compared to 146_644_156 cycles.

The original goal was to produce a sequence where various parts of the
text interpreter have branch mispredictions; the less predictable the sequence, the better. In this respect the second sequence is
significantly better. For a sequence of 1000 words/numbers,
text-interpreted 1000 times, the number of mispredictions is:

728_322 first approach
821_813 second approach

(with 247_125 mispredictions coming from (mainly) Gforth startup).

In related problem I used the following technique. Basic part
is: at each position you need to generate next step with right
probablity. How to know probabilities? Just count all possible
continuations. To do this with reasonable efficiency start
from the end and observe that legality of a continuation depends
only on number of elements on the stack, but does not depend on
sequence of operations that put the items on the stack. Once
you know counts for all contunuations of length k you can compute
counts for contunuations of length k + 1 (each possible first step
leads to new state for which count is known).

This produces uniform distribution, so in this sense gives
pretty good randomness. OTOH size of table of counts grows
quadratically with length, so it may be too costly to
compute for long sequences. But if one wants several sequences
of modest length (my case) cost of table is amortized over
many uses.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

Recent Visitors

System Info

Generating a random sequence of Forth words