As I had trouble finding it, perhaps others too. Here's the link:
http://www.euroforth.org/ef25/papers/
There is no link from the main page.
dxf <dxforth@gmail.com> writes:
As I had trouble finding it, perhaps others too. Here's the link:
http://www.euroforth.org/ef25/papers/
There is no link from the main page.
Thank you. As it happens, yesterday I created the post-conference proceedings that includes a late paper and the slides that were
provided by their authors (not that many; apparently many authors are
content with the prospect of their presentation being preserved on
video). I have now updated various links for the post-conference
state (link from www.euroforth.org to proceedings, and from the
proceedings to the euro.theforth.net page).
Unfortunately, the videos are not yet available. Gerald Wodni has not
yet had the time to process them. He mentioned something like "after January" or somesuch.
I think that submitting slides has not just the advantage that they
are published earlier in this case (or at all in the 2023 case, where
the audio was so problematic that most videos were not published), but
also that one can read them faster than watch a video; the videos have
the audio track and interactive demos in addition to the text and
graphics of the slides, though.
- anton
On 15-01-2026 13:04, Anton Ertl wrote:...
A few observations concerning the IMHO most interesting paper,
"Code-Copying Compilation in Production":
3. Commercial compilers (partly) using conventional compilers (see TF,
fig. 4.7) - that was new to me;
4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me
too. I might experiment with that one;
5. I added GCC extension support to 4tH in version 3.62.0. At the
time, it improved performance by about 25%. By accident I found out
that was no longer true. switch() based was faster. I didn't know
there had been changes in that regard to GCC.
Hans Bezemer <the.beez.speaks@gmail.com> writes:
5. I added GCC extension support to 4tH in version 3.62.0. At the
time, it improved performance by about 25%. By accident I found out
that was no longer true. switch() based was faster. I didn't know
there had been changes in that regard to GCC.
If you mean the goto *a feature, these days you might try using tail
calls instead. GCC and LLVM both now support a musttail attribute that ensures this optimization, or signals a compile-time error if it can't.
https://lwn.net/Articles/1033373/
The tail call method however, requires an entirely different
VM. That's a lot of work for about 10% performance improvement - that
may not even last for a single GCC update. And requires two VM's to maintain..
Hans Bezemer <the.beez.speaks@gmail.com> writes:
The tail call method however, requires an entirely different
VM. That's a lot of work for about 10% performance improvement - that
may not even last for a single GCC update. And requires two VM's to
maintain..
You'd have to change the VM but on the other hand, it's a documented and supported feature of both GCC and Clang, and other compilers might get
it too. I wouldn't worry about it vanishing with the next GCC update.
Hans Bezemer <the.beez.speaks@gmail.com> writes:
5. I added GCC extension support to 4tH in version 3.62.0. At the
time, it improved performance by about 25%. By accident I found out
that was no longer true. switch() based was faster. I didn't know
there had been changes in that regard to GCC.
If you mean the goto *a feature, these days you might try using tail
calls instead. GCC and LLVM both now support a musttail attribute that >ensures this optimization, or signals a compile-time error if it can't.
https://lwn.net/Articles/1033373/
On 17-01-2026 08:10, Paul Rubin wrote:
Hans Bezemer <the.beez.speaks@gmail.com> writes:
5. I added GCC extension support to 4tH in version 3.62.0. At the
time, it improved performance by about 25%. By accident I found out
that was no longer true. switch() based was faster. I didn't know
there had been changes in that regard to GCC.
The tail call method however, requires an entirely different VM.
Hans Bezemer <the.beez.speaks@gmail.com> writes:
5. I added GCC extension support to 4tH in version 3.62.0. At the
time, it improved performance by about 25%. By accident I found out
that was no longer true. switch() based was faster. I didn't know
there had been changes in that regard to GCC.
If you mean the goto *a feature, these days you might try using tail
calls instead. GCC and LLVM both now support a musttail attribute that ensures this optimization, or signals a compile-time error if it can't.
https://lwn.net/Articles/1033373/
for the tail call version it was changed to
RELOAD() opcode func=(opcode)tbl[*ip++]; __attribute__((musttail))
return func(ip, tbl, TOP, FTOP, sp, rp, fp, lp)
and for the tailcall version
mov rax, qword ptr [r13 + 8*rax]
rex64 jmp rax
It also turns out that the musttail attribute is not necessary
It will generate a tailcall aanyway. The difference is that with
musttail it will report an error if it cannot do the tailcall.
Unfortunately GCC does not recognize preserve_none and uses the stack
for some parameters
// VM8 C variant using tailcalls ...
#define FUNC __attribute__((preserve_none)) void
If you pass an address a as a tail call is it approximately equal
to coroutines:
peter <peter.noreply@tin.it> writes:
for the tail call version it was changed to
RELOAD() opcode func=(opcode)tbl[*ip++]; __attribute__((musttail))
return func(ip, tbl, TOP, FTOP, sp, rp, fp, lp)
You could possibly use "inline RELOAD() { .... ;}" instead of the macro.
and for the tailcall version
mov rax, qword ptr [r13 + 8*rax]
rex64 jmp rax
I wonder why the tailcall version didn't combine the mov with the jmp
like the other version did.
It also turns out that the musttail attribute is not necessary
It will generate a tailcall aanyway. The difference is that with
musttail it will report an error if it cannot do the tailcall.
Yes, TCO has been present since the beginning but it's been
opportunistic rather than something you can rely on.
Unfortunately GCC does not recognize preserve_none and uses the stack
for some parameters
Oh that's interesting. I half remember there being some other feature
for that, but who knows. Does -fwhole-program help?
albert@spenarnc.xs4all.nl writes:
If you pass an address a as a tail call is it approximately equal
to coroutines:
No I don't think so. The tail call is just a jump to that address
(changes the program counter). A coroutine jump also has to change the
stack pointer. See the section "Knuth's coroutines" here:
https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html
Some Forths have a CO primitive that I think is similar. There is
something like it on the Greenarrays processor.
On Fri, 16 Jan 2026 23:10:24 -0800
Paul Rubin <no.email@nospam.invalid> wrote:
The VM was written from the begining in X64 assembler (13 years ago)
4 years ago I also implemented the VM i C to simplify porting to ARM64.
At that time the asm version was about 10% faster then the generated
C code, today the speed is about the same. C compilers have improved.
Much more important is the __attribute__((preserve_none)) before
each function. This indicated that more registers will be used to pass >parameters. As seen above I pass 8 parameters to each function and
they need to be in registers to match the asmbler written code.
This is done automatically in the goto version as everything is in
one function there.
In the end it is more how you like to write your VM, as one function
or one for each "opcode".
Unfortunately GCC does not recognize preserve_none and uses the stack
for some parameters
peter <peter.noreply@tin.it> writes:
and for the tailcall version
mov rax, qword ptr [r13 + 8*rax]
rex64 jmp rax
I wonder why the tailcall version didn't combine the mov with the jmp
like the other version did.
Yes, TCO has been present since the beginning but it's been
opportunistic rather than something you can rely on.
Unfortunately GCC does not recognize preserve_none and uses the stack
for some parameters
Oh that's interesting. I half remember there being some other feature
for that, but who knows.
Does -fwhole-program help?
Hans Bezemer <the.beez.speaks@gmail.com> writes:
On 15-01-2026 13:04, Anton Ertl wrote:...
A few observations concerning the IMHO most interesting paper,
"Code-Copying Compilation in Production":
3. Commercial compilers (partly) using conventional compilers (see TF,
fig. 4.7) - that was new to me;
All Forth compilers I know work at the text interpretation level as
the "Forth compiler" of Thinking Forth, Figure 4.7.
4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me
too. I might experiment with that one;
I have analyzed it for bubblesort. There the problem is that gcc -O3 auto-vectorizes the pair of loads and the pair of stores (when the two elements are swapped). As a result, if a pair is stored in one
iteration, the next iteration loads a pair that overlaps the
previously stored pair. This means that the hardware cannot use its
fast path in store-to-load forwarding, and leads to a huge slowdown.
For a benchmark that has been around for over 40 years.
In addition, the code generated by gcc -O3 also executes several
additonal instructions per iteration, so I doubt that it would be
faster even if the store-to-load forwarding problem did not exist.
For fib, I have also looked at the generated code, but have not
understood it well enough to see why the code generated by gcc -O3 is
slower.
- anton
I've done my thing, compiled 4tH with optimizations -O3 till -O0.
I thought, let's make this simple and execute ALL benchmarks I got. Some
of them have become useless, though for the simple reason hardware has >become that much better.
But still, here it is. Overall, the performance consistently
deteriorates, aka -O3 gives the best performance.
Hans Bezemer <the.beez.speaks@gmail.com> writes:
I've done my thing, compiled 4tH with optimizations -O3 till -O0.
I thought, let's make this simple and execute ALL benchmarks I got. Some >>of them have become useless, though for the simple reason hardware has >>become that much better.
But still, here it is. Overall, the performance consistently
deteriorates, aka -O3 gives the best performance.
Which compiler and which hardware?
For a random program, I would expect higher optimization levels to
produe faster code. For a Forth system and these recent gccs, the >auto-vectorization of adjacent memory accesses may lead to similar
problems as in the C bubble-sort benchmark. In Gforth, this actually
happens unless we disable vectorization (which we normally do), and, >moreover, with the vectorized code, gcc introduces additional
inefficiencies (see below).
Here's the output of ./gforth-fast onebench.fs compiled from the
current development version with gcc-12.2 and running on a Ryzen 5800X >(numbers are times, lower is better):
sieve bubble matrix fib fft gcc options
0.025 0.023 0.013 0.033 0.016 -O2
0.025 0.023 0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default) 0.404 0.418 0.377 0.472 0.244 -O3 (with auto vectorization)
0.145 0.122 0.124 0.122 0.073 gforth default, using --no-dynamic
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Hans Bezemer <the.beez.speaks@gmail.com> writes:
I've done my thing, compiled 4tH with optimizations -O3 till -O0.
I thought, let's make this simple and execute ALL benchmarks I got. Some >>of them have become useless, though for the simple reason hardware has >>become that much better.
But still, here it is. Overall, the performance consistently >>deteriorates, aka -O3 gives the best performance.
Which compiler and which hardware?
For a random program, I would expect higher optimization levels to
produe faster code. For a Forth system and these recent gccs, the >auto-vectorization of adjacent memory accesses may lead to similar
problems as in the C bubble-sort benchmark. In Gforth, this actually >happens unless we disable vectorization (which we normally do), and, >moreover, with the vectorized code, gcc introduces additional >inefficiencies (see below).
Here's the output of ./gforth-fast onebench.fs compiled from the
current development version with gcc-12.2 and running on a Ryzen 5800X >(numbers are times, lower is better):
sieve bubble matrix fib fft gcc options
0.025 0.023 0.013 0.033 0.016 -O2
0.025 0.023 0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default) 0.404 0.418 0.377 0.472 0.244 -O3 (with auto vectorization)
0.145 0.122 0.124 0.122 0.073 gforth default, using --no-dynamic
I have now also tried it with gcc-14.2, and that produces better code. Results from a Xeon E-2388G (Rocket Lake):
sieve bubble matrix fib fft gcc options
0.032 0.032 0.015 0.037 0.014 -O2
0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)
The code for ROT and 2SWAP does not use auto-vectorization, and the
code for 2! uses auto-vectorization in a way that reduces the
instruction count:
-O3 (auto-vectorized) -O3 -fno-tree-vectorize
add $0x8,%rbx add $0x8,%rbx
movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
add $0x18,%r13 mov 0x8(%r13),%rdx
movhps -0x8(%r13),%xmm0 add $0x18,%r13
movups %xmm0,(%r8) mov %rdx,(%r8)
mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
mov (%rbx),%rax mov 0x0(%r13),%r8
jmp *%rax mov (%rbx),%rax
jmp *%rax
And the common tail with all these move instructions is gone.
- anton
On Sat, 24 Jan 2026 16:47:16 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I have now also tried it with gcc-14.2, and that produces better code.
Results from a Xeon E-2388G (Rocket Lake):
sieve bubble matrix fib fft gcc options
0.032 0.032 0.015 0.037 0.014 -O2
0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)
The code for ROT and 2SWAP does not use auto-vectorization, and the
code for 2! uses auto-vectorization in a way that reduces the
instruction count:
-O3 (auto-vectorized) -O3 -fno-tree-vectorize
add $0x8,%rbx add $0x8,%rbx
movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
add $0x18,%r13 mov 0x8(%r13),%rdx
movhps -0x8(%r13),%xmm0 add $0x18,%r13
movups %xmm0,(%r8) mov %rdx,(%r8)
mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
mov (%rbx),%rax mov 0x0(%r13),%r8
jmp *%rax mov (%rbx),%rax
jmp *%rax
And the common tail with all these move instructions is gone.
- anton
What does your C code looks like? I could not get clang or gcc to auto vectories
with my existing code
UNS64 *tmp64 = (UNS64*)TOP;
tmp64[0] = sp[0];
tmp64[1] = sp[1];
TOP = sp[2];
sp += 3;
typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
On Mon, 26 Jan 2026 19:24:43 GMT...
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
peter <peter.noreply@tin.it> writes:
On Sat, 24 Jan 2026 16:47:16 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
The code for ROT and 2SWAP does not use auto-vectorization, and the
code for 2! uses auto-vectorization in a way that reduces the
instruction count:
-O3 (auto-vectorized) -O3 -fno-tree-vectorize
add $0x8,%rbx add $0x8,%rbx
movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
add $0x18,%r13 mov 0x8(%r13),%rdx
movhps -0x8(%r13),%xmm0 add $0x18,%r13
movups %xmm0,(%r8) mov %rdx,(%r8)
mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
mov (%rbx),%rax mov 0x0(%r13),%r8
jmp *%rax mov (%rbx),%rax
jmp *%rax
UNS64 *tmp64 = (UNS64*)TOP;
UNS64 d0=sp[0];
UNS64 d1=sp[1];
tmp64[0] = d0;
tmp64[1] = d1;
TOP = sp[2];
sp += 3;
made the compiler (clang-21 in this case) generate the expected code
typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
I'll have to remember the aligned attribute for future games with gcc
explicit vectorization.
Without that it will generate the opcodes that needs 16 byte alignment
peter <peter.noreply@tin.it> writes:
On Mon, 26 Jan 2026 19:24:43 GMT...
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
peter <peter.noreply@tin.it> writes:
On Sat, 24 Jan 2026 16:47:16 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
The code for ROT and 2SWAP does not use auto-vectorization, and the
code for 2! uses auto-vectorization in a way that reduces the
instruction count:
-O3 (auto-vectorized) -O3 -fno-tree-vectorize
add $0x8,%rbx add $0x8,%rbx
movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
add $0x18,%r13 mov 0x8(%r13),%rdx
movhps -0x8(%r13),%xmm0 add $0x18,%r13
movups %xmm0,(%r8) mov %rdx,(%r8)
mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
mov (%rbx),%rax mov 0x0(%r13),%r8
jmp *%rax mov (%rbx),%rax
jmp *%rax
UNS64 *tmp64 = (UNS64*)TOP;
UNS64 d0=sp[0];
UNS64 d1=sp[1];
tmp64[0] = d0;
tmp64[1] = d1;
TOP = sp[2];
sp += 3;
made the compiler (clang-21 in this case) generate the expected code
The auto-vectorized implementation of 2! above should perform ok,
because it loads each stack item separately, and the wide movups is
only used for the stores. If there is a wide load from the stack
involved, I expect a significant slowdown, because the stack items
usually have been stored recently, and narrow-store-to-wide-load
forwarding is a slow path on recent (and presumably also older) CPU
cores: https://www.complang.tuwien.ac.at/anton/stwlf/
typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
I'll have to remember the aligned attribute for future games with gcc
explicit vectorization.
Without that it will generate the opcodes that needs 16 byte alignment
Yes. Until now I worked around that by using memcpy to a vector
variable, but this approach is much more convenient.
- anton--
I always wonder, is this relevant to the industrial applications of
gforth or gforth based programs that are sold commercially?
peter <peter.noreply@tin.it> writes:
On Sat, 24 Jan 2026 16:47:16 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I have now also tried it with gcc-14.2, and that produces better code.
Results from a Xeon E-2388G (Rocket Lake):
sieve bubble matrix fib fft gcc options
0.032 0.032 0.015 0.037 0.014 -O2
0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default) >> 0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)
The code for ROT and 2SWAP does not use auto-vectorization, and the
code for 2! uses auto-vectorization in a way that reduces the
instruction count:
-O3 (auto-vectorized) -O3 -fno-tree-vectorize
add $0x8,%rbx add $0x8,%rbx
movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
add $0x18,%r13 mov 0x8(%r13),%rdx
movhps -0x8(%r13),%xmm0 add $0x18,%r13
movups %xmm0,(%r8) mov %rdx,(%r8)
mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
mov (%rbx),%rax mov 0x0(%r13),%r8
jmp *%rax mov (%rbx),%rax
jmp *%rax
And the common tail with all these move instructions is gone.
- anton
What does your C code looks like? I could not get clang or gcc to auto vectories
with my existing code
UNS64 *tmp64 = (UNS64*)TOP;
tmp64[0] = sp[0];
tmp64[1] = sp[1];
TOP = sp[2];
sp += 3;
Gforth's source code for 2! is:
2! ( w1 w2 a_addr -- ) core two_store
""Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell."" a_addr[0] = w2;
a_addr[1] = w1;
A generator produces the following from that, which is passed to gcc:
LABEL(two_store) /* 2! ( w1 w2 a_addr -- ) S1 -- S1 */
/* Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell. */ NAME("2!")
ip += 1;
LABEL1(two_store)
{
DEF_CA
MAYBE_UNUSED Cell w1;
MAYBE_UNUSED Cell w2;
MAYBE_UNUSED Cell * a_addr;
NEXT_P0;
vm_Cell2w(sp[2],w1);
vm_Cell2w(sp[1],w2);
vm_Cell2a_(spTOS,a_addr);
#ifdef VM_DEBUG
if (vm_debug) {
fputs(" w1=", vm_out); printarg_w(w1);
fputs(" w2=", vm_out); printarg_w(w2);
fputs(" a_addr=", vm_out); printarg_a_(a_addr);
}
#endif
sp += 3;
{
#line 1815 "prim"
a_addr[0] = w2;
a_addr[1] = w1;
#line 10136 "prim-fast.i"
}
#ifdef VM_DEBUG
if (vm_debug) {
fputs(" -- ", vm_out); fputc('\n', vm_out);
}
#endif
NEXT_P1;
spTOS = sp[0];
LABEL2(two_store)
NAME1("l2-two_store")
NEXT_P1_5;
LABEL3(two_store)
NAME1("l3-two_store")
DO_GOTO;
}
There are a lot of macros in this code, and I fear that expanding them
makes the code even less readable, but the essence for the
auto-vectorized part is something like:
w1 = sp[2];
w2 = sp[1];
a_addr = spTOS;
sp += 3;
a_addr[0] = w2;
a_addr[1] = w1;
spTOS = sp[0];
My guess is that in your code the compiler expected that sp[1] might
alias with tmp64[0], and therefore did not vectorize the loads and the stores, whereas in the Gforth code, the loads both happen first, and
then the two stores, and gcc can vectorize that. I doubt that there
is a big benefit from that, though.
typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));
I'll have to remember the aligned attribute for future games with gcc explicit vectorization.
- anton
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 59 |
| Nodes: | 6 (1 / 5) |
| Uptime: | 16:04:04 |
| Calls: | 810 |
| Calls today: | 1 |
| Files: | 1,287 |
| D/L today: |
10 files (21,017K bytes) |
| Messages: | 193,341 |