Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 28 |
Nodes: | 6 (0 / 6) |
Uptime: | 64:28:04 |
Calls: | 425 |
Calls today: | 3 |
Files: | 1,025 |
Messages: | 91,329 |
Posted today: | 1 |
In case of slower ITC non-optimizing Forths - like
fig-Forth, as the most obvious example - the "boost"
may be noticeable.
I'll check that.
code +> ( x y z -- )
dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
0 [bx] ax add dx bx xchg ax 0 [bx] mov next
end-code
Timing (adjusted for loop time):
var1 @ var2 @ + var3 ! 8019 mS
var1 var2 var3 +> 5657 mS
So even in case of fast DTC Forth, like DX Forth,
it's already something worth of closer attention,
I believe.
I expect even bigger gain in case of older fig-Forth
model.
--
Potential alternative is
a pair of operations, say PUSH and POP, and Forth compiler
that replaces pair like V1 @ by PUSH(V1). Note that here
address of V1 is intended to be part to PUSH (so it will
take as much space as separate V1 and @, but is only a
single primitive).
More generally, a simple "optimizer" that replaces short
sequences of Forth primitives by different, shorter sequence
of primitives is likely to give similar gain. However,
chance of match decreases with length of the sequence.
Above you bet on relatively long seqences (and on programmer
writing alternative seqence). Shorter seqences have more
chance of matching, so you need smaller number of them
for similar gain.
One can
do better than using machine stack, namely keeping thing in
registers, but that means generating machine code and doing
optimization.
zbigniew2011@gmail.com (LIT) writes:
V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms
Too much OOS thinking? Try
V1 @ V2 @ V1 ! V2 !
V1 V2 :=: - 15s 260ms
So there is a noticeable difference indeed.
The question is how often you use these new words in applications.
- anton
I remain skeptical of such optimizations. Not even twice the
performance
and the hope it represents a bottle-neck in order to realize that gain.
I've got a feeling it would have more
of a significance in 8088 era, say IBM 5150
or XTs. 486 is already "too good" probably
to see as much as 50% gain.
I've got working XT board - if I manage to
get at least FDD interface for that (no,
not today... it'll take some time) I'll
do some more testing.
: :=: ( a b -- ) \ exchange values among two variables
OVER @ >R DUP @ ROT ! R> SWAP ! ;
mhx@iae.nl (mhx) writes:
: :=: ( a b -- ) \ exchange values among two variables
OVER @ >R DUP @ ROT ! R> SWAP ! ;
: :=: ( addr1 addr2 -- )
OVER @ >R DUP @ ROT ! R> SWAP ! ;
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
: :=: ( addr1 addr2 -- )
OVER @ >R DUP @ ROT ! R> SWAP ! ;
: ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;
looks a little simpler.
r 2->1 mov -$8[r14],r13 mov -$8[rax],r13 @ 1->1mov -$8[r14],r15 sub r14,$10 mov r13,[r10] mov r13,$00[r13]
1->2 @ 2->2 mov r15,[r15] mov r15,$08[r10]mov r15,[r14] mov r15,[r15] @local0 2->3 add r10,$08
Results (on Zen4):
gforth-fast (development): ...
On 27/02/2025 07:29, Anton Ertl wrote:...
\ Anton Ertl
: exchange2 ( addr1 addr2 -- )
dup >r @ over @ r> ! swap ! ;
Results (on Zen4):How does a crude definition not involving the R stack compare:
gforth-fast (development):
:=: exchange ex ex-locals exchange2
814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. >> 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst. ...
: ex3 over @ over @ 3 pick ! over ! 2drop ;
r 1->1 mov [r10],r13mov -$08[r14],r13 sub r10,$08
2->3 fourth 2->3mov r9,[r14] mov r9,$10[r10]
Paul Rubin <no.email@nospam.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
: :=: ( addr1 addr2 -- )
OVER @ >R DUP @ ROT ! R> SWAP ! ;
: ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;
looks a little simpler.
This inspires another one:
: exchange2 ( addr1 addr2 -- )
dup >r @ over @ r> ! swap ! ;
With some other versions this results in the following benchmark
program:
[defined] !@ [if]
: exchange ( addr1 addr2 -- )
over @ swap !@ swap ! ;
[then]
\ Paul Rubin <875xkwo5io.fsf@nightsong.com>
: ex ( addr1 addr2 -- )
2>r 2r@ @ swap @ r> ! r> ! ;
: ex-locals {: x y -- :} x @ y @ x ! y ! ;
\ Anton Ertl
: exchange2 ( addr1 addr2 -- )
dup >r @ over @ r> ! swap ! ;
\ Marcel Hendrix
: :=: ( addr1 addr2 -- )
OVER @ >R DUP @ ROT ! R> SWAP ! ;
variable v1
variable v2
1 v1 !
2 v2 !
: bench ( "name" -- )
v1 v2
:noname ]] 100000000 0 do 2dup [[ parse-name evaluate ]] loop ; [[
execute ;
Results (on Zen4):
gforth-fast (development):
:=: exchange ex ex-locals exchange2
814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst.
vfx64 5.43:
:=: ex ex-locals exchange2
335_298_202 432_614_804 928_542_678 336_134_513 cyc. 1_166_400_242 1_366_264_943 2_866_547_067 1_166_280_641 inst.
And here's the code produced by gforth-fast:
:=: ex ex-locals exchange2
over 1->2 2>r 1->0 l 1->1 dup >r 1->1
mov r15,$08[r10] add r10,$08 mov rax,rbp >r 1->1
@ 2->2 mov r15,r13 add r10,$08 mov -$8[r14],r13
mov r15,[r15] mov r13,[r10] lea rbp,-$8[rbp] sub r14,$08
r 2->1 mov -$8[r14],r13 mov -$8[rax],r13 @ 1->1mov -$8[r14],r15 sub r14,$10 mov r13,[r10] mov r13,$00[r13]
sub r14,$08 mov [r14],r15 >l @local0 1->1 over 1->2
dup 1->2 2r@ 0->2 @local0 1->1 mov r15,$08[r10]
mov r15,r13 mov r13,$08[r14] mov rax,rbp @ 2->2
@ 2->2 mov r15,[r14] lea rbp,-$8[rbp] mov r15,[r15]
mov r15,[r15] @ 2->2 mov -$8[rax],r13 r> 2->3
rot 2->3 mov r15,[r15] @ 1->1 mov r9,[r14]
mov r9,$08[r10] swap 2->2 mov r13,$00[r13] add r14,$08
add r10,$08 mov rax,r13 @local1 1->2 ! 3->1
! 3->1 mov r13,r15 mov r15,$08[rbp] mov [r9],r15
mov [r9],r15 mov r15,rax @ 2->2 swap 1->2
1->2 @ 2->2 mov r15,[r15] mov r15,$08[r10]mov r15,[r14] mov r15,[r15] @local0 2->3 add r10,$08
add r14,$08 r> 2->3 mov r9,$00[rbp] ! 2->0
swap 2->3 mov r9,[r14] ! 3->1 mov [r15],r13
add r10,$08 add r14,$08 mov [r9],r15 ;s 0->1
mov r9,r13 ! 3->1 @local1 1->2 mov r13,$08[r10]
mov r13,[r10] mov [r9],r15 mov r15,$08[rbp] add r10,$08
! 3->1 r> 1->2 ! 2->0 mov rbx,[r14]
mov [r9],r15 mov r15,[r14] mov [r15],r13 add r14,$08
;s 1->1 add r14,$08 lp+2 0->1 mov rax,[rbx]
mov rbx,[r14] ! 2->0 mov r13,$08[r10] jmp eax
add r14,$08 mov [r15],r13 add r10,$08
mov rax,[rbx] ;s 0->1 add rbp,$10
jmp eax mov r13,$08[r10] ;s 1->1
add r10,$08 mov rbx,[r14]
mov rbx,[r14] add r14,$08
add r14,$08 mov rax,[rbx]
mov rax,[rbx] jmp eax
jmp eax
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Results (on Zen4):
gforth-fast (development): ...
It's interesting how little difference there is with gforth-fast. Could
you also do gforth-itc?
exchange2 is a big win with VFX, suggesting its
optimizer could do better with some of the other versions.
Another variant:
: exchange ( addr1 addr2 -- )
dup @ rot !@ swap ! ;
This uses the primitive
'!@' ( u1 a-addr -- u2 ) gforth-experimental "store-fetch"
load U2 from A_ADDR, and store U1 there, as atomic operation
I worry that the atomic part will result in it being slower than the
versions that do not use !@.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
!@ is now the nonatomic version.
Is the nonatomic one useful often?
We've done without it all this time.
Paul Rubin <no.email@nospam.invalid> writes: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
!@ is now the nonatomic version.
Is the nonatomic one useful often?
Some numbers of uses in the Gforth image:
11 !@
3 atomic!@
66 +!
We've done without it all this time.
Sure, you can replace it with DUP @ >R ! R>. Having a word for that
relieves the programmer of producing such a sequence (possibly with a
bug) and the reader of having to analyse what's going on here.
On 01-03-2025 12:47, Anton Ertl wrote:
11 !@
3 atomic!@
66 +!
We've done without it all this time.
Sure, you can replace it with DUP @ >R ! R>. Having a word for that
relieves the programmer of producing such a sequence (possibly with a
bug) and the reader of having to analyse what's going on here.
I found the sequence exactly twice in my code
However, if it is that rare there is no point in adding it. Creating too
many superfluous abstractions may even get counter productive in the
sense that predefined abstractions are ignored and reinvented.
!@ is now the nonatomic version.
I wonder: wouldn't it be useful to have stackless basic
arith operations? I mean instead of fetching the values
first and putting them on the stack, then doing something,
and in the end storing the result somewhere wouldn't it
be practical to use directly the variables?
But after I came up with this idea I realized someone
surely invented that before - it looks so obvious — yet
I didn't see it anywhere.
Did anyone of you see something
like this in any code?
If so — actually why somehow
(probably?) such solution has not become widespread?
Looks good to me; math can be done completely in ML
avoiding "Forth machine" engagement, therefore saving many
cycles.
Probably because the case where the two operands
of a + are in memory, and the result is needed
in memory is not that frequent.
One example could be matrix multiplication.
It's rather trivial but cumbersome operation,
where usually a few transitional variables are
used to maintain clarity of the code.
Probably "bigger" Forth compilers are indeed
already "too good" for the difference to be
(practically) noticeable — still maybe for
simpler Forths, I mean like the ones for DOS
or even for 8-bit machines it would make sense?
Earlier you wrote about performance, now you switch to clarity of the
code. What is the goal?
Both — one isn't contrary to another.
What I have in mind is: by performing OOS operation
we don't have to employ the whole "Forth machine" to
do the usual things (I mean, of course, the usual
steps described by Brad Rodriguez in his "Moving
Forth" paper).
It comes with a cost: usual Forth words, that use
the stack, are versatile, while such OOS words
aren't that versatile anymore — yet (at least in
the case of ITC non-optimizing Forths) they should
be faster.
Clarity of the code comes as a "bonus" :) yes, we've
got VALUEs and I use them when needed, but their use
still means employing the "Forth machine".
I mean the description how the "Forth machine" works:
"Assume SQUARE is encountered while executing some other Forth word.
Forth's Interpreter Pointer (IP) will be pointing to a cell in memory -- >contained within that "other" word -- which contains the address of the
word SQUARE. (To be precise, that cell contains the address of SQUARE's
Code Field.) The interpreter fetches that address, and then uses it to
fetch the contents of SQUARE's Code Field. These contents are yet
another address -- the address of a machine language subroutine which >performs the word SQUARE. In pseudo-code, this is:
(IP) -> W fetch memory pointed by IP into "W" register
...W now holds address of the Code Field
IP+2 -> IP advance IP, just like a program counter
(assuming 2-byte addresses in the thread)
(W) -> X fetch memory pointed by W into "X" register
...X now holds address of the machine code
JP (X) jump to the address in the X register
This illustrates an important but rarely-elucidated principle: the
address of the Forth word just entered is kept in W. CODE words don't
need this information, but all other kinds of Forth words do.
If SQUARE were written in machine code, this would be the end of the
story: that bit of machine code would be executed, and then jump back to
the Forth interpreter -- which, since IP was incremented, is pointing to
the next word to be executed. This is why the Forth interpreter is
usually called NEXT.
But, SQUARE is a high-level "colon" definition… [..]” etc.
( https://www.bradrodriguez.com/papers/moving1.htm )
Many of these steps in particular cases can be avoided
by the use of proposed OOS words, making (at least sometimes)
the Forth program faster — and, as a kinda "bonus", clarity
of the code increases.
Probably in case of the "optimizing compiler" the gain
may not be too significant, from what I already learned
here, still in the case of simpler compilers — and maybe
especially in the case of the ones created for CPUs not
that suitable for Forth at all (lack of registers, like
8051, for example) — probably it may be advantageous.
By the "Forth machine" I mean that internal work of the
Forth compiler - see the above quote from Brad's paper
- and when we don't need to "fetch memory pointed by
IP into "W" register, advance IP, just like a program
counter" etc. etc. — replacing the whole process,
(which is repeated for each subsequent word again and
again) by a short string of ML instructions — we should
note significant gain in the processing speed.
But, SQUARE is a high-level "colon" definition… [..]” etc.
( https://www.bradrodriguez.com/papers/moving1.htm )
Many of these steps in particular cases can be avoided
by the use of proposed OOS words, making (at least sometimes)
the Forth program faster — and, as a kinda "bonus", clarity
of the code increases.
I wonder: wouldn't it be useful to have stackless basic
arith operations? I mean instead of fetching the values
first and putting them on the stack, then doing something,
and in the end storing the result somewhere wouldn't it
be practical to use directly the variables? Like this:
: +> ( addr1 addr2 addr3 -- )
rot @ rot @ + swap ! ;
Of course the above is just an illustration; I mean coding
such word directly in ML. It should be significantly
faster than going through stack usual way.
But after I came up with this idea I realized someone
surely invented that before - it looks so obvious — yet
I didn't see it anywhere. Did anyone of you see something
like this in any code? If so — actually why somehow
(probably?) such solution has not become widespread?
Looks good to me; math can be done completely in ML
avoiding "Forth machine" engagement, therefore saving many
cycles.
----
In case of slower ITC non-optimizing Forths - like
fig-Forth, as the most obvious example - the "boost"
may be noticeable.
I'll check that.
code +> ( x y z -- )
dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
0 [bx] ax add dx bx xchg ax 0 [bx] mov next
end-code
Timing (adjusted for loop time):
var1 @ var2 @ + var3 ! 8019 mS
var1 var2 var3 +> 5657 mS
So even in case of fast DTC Forth, like DX Forth,
it's already something worth of closer attention,
I believe.
I expect even bigger gain in case of older fig-Forth
model.
--
I agree with you - still it does take decent Forth programmer.
Recall the ones described by Jeff Fox? These Forth programmers,
that refused to use Machine Forth just because "they were hired
to program in ANS Forth"?
I don't believe they were be able to recode anything in
assembler - and note, it was about 30 years ago. Since that
time assembler programming became even less popular.
A bit off-topic: I have been in a similar situation when some of
our service engineers were very reluctant to modify inner
software parts of controllers. The guys were not dumb, but with
such modifications comes responsibility when s.th. unexpected
happens like a system crash. So it was more of a legal than a
technical issue.
Yes, I'm aware the reason may be different in the different
case; still Jeff portrayed that situation rather clear way:
they didn't want to use Machine Forth just because "they
were paid for ANS Forth programming", they signed kind of
agreement for that, therefore they "weren't interested" in
any changes etc.
Unfortunately we won't have any opportunity anymore to ask
Jeff for more details.
--
I know nothing about Machine Forth.
BTW: is it available for download anywhere (if not
commercial/restricted)?
V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms
V1 V2 :=: - 15s 260ms
So there is a noticeable difference indeed.
So I did some quite basic testing with x86
fig-Forth for DOS. I devised 4 OOS words:
:=: (exchange values among two variables)
pop BX
pop DI
mov AX,[BX]
xchg AX,[DI]
mov [BX],AX
jmp NEXT
++ (increment variable by one)
pop BX
inc WORD PTR [BX}
jmp NEXT
-- (similar to above, just uses dec -- not tested, it'll give same
result)
(add two variables then store result into third one)pop DI
pop BX
mov CX,[BX]
pop BX
mov AX,[BX]
add AX,CX
mov [DI],AX
jmp NEXT
How the simplistic tests have been done:
7 VARIABLE V1
8 VARIABLE V2
9 VARIABLE V3
: TOOK ( t1 t2 -- )
DROP SPLIT TIME@ DROP SPLIT
ROT SWAP - CR ." It took " U. ." seconds and "
- 10 * U. ." milliseconds "
;
: TEST1
1000 0 DO 10000 0 DO
...expression...
LOOP LOOP
;
0 0 TIME! TIME@ TEST TOOK
The results are (for the following expressions):
V1 @ V2 @ + V3 ! - 25s 430ms
V1 V2 V3 +> - 17s 240ms
1 V1 +! - 14s 60ms
V1 ++ - 10s 820ms
V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms
V1 V2 :=: - 15s 260ms
So there is a noticeable difference indeed.