• EuroForth 2025 preliminary proceedings

    From dxf@dxforth@gmail.com to comp.lang.forth on Thu Jan 15 17:41:04 2026
    From Newsgroup: comp.lang.forth

    As I had trouble finding it, perhaps others too. Here's the link:

    http://www.euroforth.org/ef25/papers/

    There is no link from the main page.

    Someone had referenced Nick Nelson's 'Forth 2025' paper and I was curious
    to read it.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jan 15 12:04:13 2026
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    As I had trouble finding it, perhaps others too. Here's the link:

    http://www.euroforth.org/ef25/papers/

    There is no link from the main page.

    Thank you. As it happens, yesterday I created the post-conference
    proceedings that includes a late paper and the slides that were
    provided by their authors (not that many; apparently many authors are
    content with the prospect of their presentation being preserved on
    video). I have now updated various links for the post-conference
    state (link from www.euroforth.org to proceedings, and from the
    proceedings to the euro.theforth.net page).

    Unfortunately, the videos are not yet available. Gerald Wodni has not
    yet had the time to process them. He mentioned something like "after
    January" or somesuch.

    I think that submitting slides has not just the advantage that they
    are published earlier in this case (or at all in the 2023 case, where
    the audio was so problematic that most videos were not published), but
    also that one can read them faster than watch a video; the videos have
    the audio track and interactive demos in addition to the text and
    graphics of the slides, though.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Fri Jan 16 15:25:22 2026
    From Newsgroup: comp.lang.forth

    On 15-01-2026 13:04, Anton Ertl wrote:

    A few observations concerning the IMHO most interesting paper,
    "Code-Copying Compilation in Production":

    1. Code copying indeed makes a big difference, overall I estimate about
    twice as fast;
    2. The performance of VFX Forth continues to impress me, keeping up
    nicely, even with C compiled code;
    3. Commercial compilers (partly) using conventional compilers (see TF,
    fig. 4.7) - that was new to me;
    4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me
    too. I might experiment with that one;
    5. I added GCC extension support to 4tH in version 3.62.0. At the time,
    it improved performance by about 25%. By accident I found out that was
    no longer true. switch() based was faster. I didn't know there had been changes in that regard to GCC.

    Hans Bezemer

    dxf <dxforth@gmail.com> writes:
    As I had trouble finding it, perhaps others too. Here's the link:

    http://www.euroforth.org/ef25/papers/

    There is no link from the main page.

    Thank you. As it happens, yesterday I created the post-conference proceedings that includes a late paper and the slides that were
    provided by their authors (not that many; apparently many authors are
    content with the prospect of their presentation being preserved on
    video). I have now updated various links for the post-conference
    state (link from www.euroforth.org to proceedings, and from the
    proceedings to the euro.theforth.net page).

    Unfortunately, the videos are not yet available. Gerald Wodni has not
    yet had the time to process them. He mentioned something like "after January" or somesuch.

    I think that submitting slides has not just the advantage that they
    are published earlier in this case (or at all in the 2023 case, where
    the audio was so problematic that most videos were not published), but
    also that one can read them faster than watch a video; the videos have
    the audio track and interactive demos in addition to the text and
    graphics of the slides, though.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jan 16 17:38:03 2026
    From Newsgroup: comp.lang.forth

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    On 15-01-2026 13:04, Anton Ertl wrote:

    A few observations concerning the IMHO most interesting paper,
    "Code-Copying Compilation in Production":
    ...
    3. Commercial compilers (partly) using conventional compilers (see TF,
    fig. 4.7) - that was new to me;

    All Forth compilers I know work at the text interpretation level as
    the "Forth compiler" of Thinking Forth, Figure 4.7.

    4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me
    too. I might experiment with that one;

    I have analyzed it for bubblesort. There the problem is that gcc -O3 auto-vectorizes the pair of loads and the pair of stores (when the two
    elements are swapped). As a result, if a pair is stored in one
    iteration, the next iteration loads a pair that overlaps the
    previously stored pair. This means that the hardware cannot use its
    fast path in store-to-load forwarding, and leads to a huge slowdown.
    For a benchmark that has been around for over 40 years.

    In addition, the code generated by gcc -O3 also executes several
    additonal instructions per iteration, so I doubt that it would be
    faster even if the store-to-load forwarding problem did not exist.

    For fib, I have also looked at the generated code, but have not
    understood it well enough to see why the code generated by gcc -O3 is
    slower.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Fri Jan 16 23:10:24 2026
    From Newsgroup: comp.lang.forth

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    5. I added GCC extension support to 4tH in version 3.62.0. At the
    time, it improved performance by about 25%. By accident I found out
    that was no longer true. switch() based was faster. I didn't know
    there had been changes in that regard to GCC.

    If you mean the goto *a feature, these days you might try using tail
    calls instead. GCC and LLVM both now support a musttail attribute that
    ensures this optimization, or signals a compile-time error if it can't.

    https://lwn.net/Articles/1033373/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sat Jan 17 16:58:24 2026
    From Newsgroup: comp.lang.forth

    On 17-01-2026 08:10, Paul Rubin wrote:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    5. I added GCC extension support to 4tH in version 3.62.0. At the
    time, it improved performance by about 25%. By accident I found out
    that was no longer true. switch() based was faster. I didn't know
    there had been changes in that regard to GCC.

    If you mean the goto *a feature, these days you might try using tail
    calls instead. GCC and LLVM both now support a musttail attribute that ensures this optimization, or signals a compile-time error if it can't.

    https://lwn.net/Articles/1033373/

    Thanks for the article! But contrary to the Python interpreter, you
    could (thanks to some preprocessor magic) select how 4tH's VM would be compiled with NO changes to the source code whatsoever. That's why it
    could be reversed so easily by accident.

    The tail call method however, requires an entirely different VM. That's
    a lot of work for about 10% performance improvement - that may not even
    last for a single GCC update. And requires two VM's to maintain..

    So, I have to contemplate this carefully before putting work in it. But
    it's nice to know that I was not crazy noticing this ;-) And learning
    about a new GCC technique. :)

    Hans Bezemer

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Sat Jan 17 20:21:28 2026
    From Newsgroup: comp.lang.forth

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    The tail call method however, requires an entirely different
    VM. That's a lot of work for about 10% performance improvement - that
    may not even last for a single GCC update. And requires two VM's to maintain..

    You'd have to change the VM but on the other hand, it's a documented and supported feature of both GCC and Clang, and other compilers might get
    it too. I wouldn't worry about it vanishing with the next GCC update.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sun Jan 18 15:26:23 2026
    From Newsgroup: comp.lang.forth

    On 18-01-2026 05:21, Paul Rubin wrote:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    The tail call method however, requires an entirely different
    VM. That's a lot of work for about 10% performance improvement - that
    may not even last for a single GCC update. And requires two VM's to
    maintain..

    You'd have to change the VM but on the other hand, it's a documented and supported feature of both GCC and Clang, and other compilers might get
    it too. I wouldn't worry about it vanishing with the next GCC update.

    Well, the "goto" feature hasn't disappeared as well. It's just been
    nullified. Rendered useless. That's what I mean. And again? 10%? Really?

    Hans Bezemer
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Jan 18 16:34:50 2026
    From Newsgroup: comp.lang.forth

    In article <87wm1gpvdr.fsf@nightsong.com>,
    Paul Rubin <no.email@nospam.invalid> wrote:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    5. I added GCC extension support to 4tH in version 3.62.0. At the
    time, it improved performance by about 25%. By accident I found out
    that was no longer true. switch() based was faster. I didn't know
    there had been changes in that regard to GCC.

    If you mean the goto *a feature, these days you might try using tail
    calls instead. GCC and LLVM both now support a musttail attribute that >ensures this optimization, or signals a compile-time error if it can't.

    https://lwn.net/Articles/1033373/

    If you pass an address a as a tail call is it approximately equal
    to coroutines:

    : HEX: R> BASE @ >R >R HEX CO R> BASE ! ;

    Used for example as

    : .H HEX: . ;

    In this case the tail call is `` R> BASE ! '' to restore the base?

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Jan 18 22:17:45 2026
    From Newsgroup: comp.lang.forth

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    On 17-01-2026 08:10, Paul Rubin wrote:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    5. I added GCC extension support to 4tH in version 3.62.0. At the
    time, it improved performance by about 25%. By accident I found out
    that was no longer true. switch() based was faster. I didn't know
    there had been changes in that regard to GCC.

    You would have to look at the generated code. Which gcc version did
    you use? Certainly in my results on <http://www.complang.tuwien.ac.at/forth/threading/> switch usually is
    slower than direct or indirect threaded code.

    The tail call method however, requires an entirely different VM.

    It's also just a question of defining some macros appropriately.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Mon Jan 19 23:26:35 2026
    From Newsgroup: comp.lang.forth

    On Fri, 16 Jan 2026 23:10:24 -0800
    Paul Rubin <no.email@nospam.invalid> wrote:

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    5. I added GCC extension support to 4tH in version 3.62.0. At the
    time, it improved performance by about 25%. By accident I found out
    that was no longer true. switch() based was faster. I didn't know
    there had been changes in that regard to GCC.

    If you mean the goto *a feature, these days you might try using tail
    calls instead. GCC and LLVM both now support a musttail attribute that ensures this optimization, or signals a compile-time error if it can't.

    https://lwn.net/Articles/1033373/

    I got interested to understand how tail calls could improve compared
    to computed gotos. So I took the five first "opcodes" from the VM in NTF64/LXF64 to compared the generated asm.
    The VM was written from the begining in X64 assembler (13 years ago)
    4 years ago I also implemented the VM i C to simplify porting to ARM64.
    At that time the asm version was about 10% faster then the generated
    C code, today the speed is about the same. C compilers have improved.
    It was implemented using computed gotos, usingthe following macro
    as the nesting code ending each "opcode"

    #define RELOAD() code=*ip++; goto *jmp_table[code]

    for the tail call version it was changed to

    RELOAD() opcode func=(opcode)tbl[*ip++]; __attribute__((musttail))
    return func(ip, tbl, TOP, FTOP, sp, rp, fp, lp)

    (line brooken to be readable)

    The noop "opcode has just the nesting and produces the following code

    movzx r9d, byte ptr [rcx]
    inc rcx
    jmp qword ptr [rax + 8*r9]

    and for the tailcall version

    movzx eax, byte ptr [r12]
    inc r12
    mov rax, qword ptr [r13 + 8*rax]
    rex64 jmp rax

    both compiled with
    clang -S -Wall -O2 -masm=intel -o vm8test3.asm vm8tail.c

    As I suspected the code is practically identical!

    It also turns out that the musttail attribute is not necessary
    It will generate a tailcall aanyway. The difference is that with
    musttail it will report an error if it cannot do the tailcall.

    Much more important is the __attribute__((preserve_none)) before
    each function. This indicated that more registers will be used to pass parameters. As seen above I pass 8 parameters to each function and
    they need to be in registers to match the asmbler written code.
    This is done automatically in the goto version as everything is in
    one function there.

    In the end it is more how you like to write your VM, as one function
    or one for each "opcode".

    Unfortunately GCC does not recognize preserve_none and uses the stack
    for some parameters

    Here is my test code

    // VM8 C variant using computed goto

    #include <stdint.h>

    #define UNS8 unsigned char
    #define INT64 long long int
    #define UNS64 unsigned long long int

    #define RELOAD() code=*ip++; goto *jmp_table[code]

    void VM8(UNS8 *ip, UNS64 *sp, UNS64 *rp, double *fp, UNS64 *lp ) {

    const static void* jmp_table[] = {
    &&noop,
    &&swap,
    &&rot,
    &&eqzero,
    &&negate,
    };

    UNS8 code=*ip;
    UNS64 tmp;
    UNS64 TOP=*sp++;
    // double FTOP=*fp++;

    RELOAD();


    noop: // do nothing
    RELOAD();
    swap: // swap
    tmp=sp[0];
    sp[0]=TOP;
    TOP=tmp;
    RELOAD();
    rot: // rot
    tmp=TOP;
    TOP=sp[1];
    sp[1]=sp[0];
    sp[0]=tmp;
    RELOAD();
    eqzero: // 0=
    TOP=-(TOP==0);
    RELOAD();
    negate: // negate
    TOP=-TOP;
    RELOAD();


    } //vm8


    And here is the tail call version. Sorry for the long lines!

    // VM8 C variant using tailcalls

    #include <stdint.h>

    #define UNS8 unsigned char
    #define INT64 long long int
    #define UNS64 unsigned long long int


    typedef __attribute__((preserve_none)) void (*opcode) (UNS8*, UNS64*, UNS64, double, UNS64*, UNS64*, double*, UNS64*);

    #define RELOAD() opcode func=(opcode)tbl[*ip++]; __attribute__((musttail)) return func(ip, tbl, TOP, FTOP, sp, rp, fp, lp)

    #define FUNC __attribute__((preserve_none)) void

    FUNC noop(UNS8 *ip, UNS64 *tbl, UNS64 TOP, double FTOP, UNS64 *sp, UNS64 *rp, double *fp, UNS64 *lp ) // do nothing
    {
    RELOAD();
    }

    FUNC swap(UNS8 *ip, UNS64 *tbl, UNS64 TOP, double FTOP, UNS64 *sp, UNS64 *rp, double *fp, UNS64 *lp ) // swap
    {UNS64 tmp;
    tmp=sp[0];
    sp[0]=TOP;
    TOP=tmp;
    RELOAD();}

    FUNC rot(UNS8 *ip, UNS64 *tbl, UNS64 TOP, double FTOP, UNS64 *sp, UNS64 *rp, double *fp, UNS64 *lp ) // rot
    {UNS64 tmp=TOP;
    TOP=sp[1];
    sp[1]=sp[0];
    sp[0]=tmp;
    RELOAD();}

    FUNC eqzero(UNS8 *ip, UNS64 *tbl, UNS64 TOP, double FTOP, UNS64 *sp, UNS64 *rp, double *fp, UNS64 *lp ) // 0=
    {TOP=-(TOP==0);
    RELOAD();}

    FUNC negate(UNS8 *ip, UNS64 *tbl, UNS64 TOP, double FTOP, UNS64 *sp, UNS64 *rp, double *fp, UNS64 *lp ) // negate
    {TOP=-TOP;
    RELOAD();}

    opcode jmp_table[]={
    noop,
    swap,
    rot,
    eqzero,
    negate,
    };



    void VM8(UNS8 *ip, UNS64 *sp, UNS64 *rp, double *fp, UNS64 *lp ) {


    UNS64 *tbl=(UNS64*)&jmp_table;
    UNS64 TOP=*sp++;
    double FTOP=*fp++;


    opcode func=(opcode)tbl[*ip++];
    func( ip, tbl, TOP, FTOP, sp, rp, fp, lp);

    }

    //vm8

    BR
    Peter




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Mon Jan 19 15:22:07 2026
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    for the tail call version it was changed to

    RELOAD() opcode func=(opcode)tbl[*ip++]; __attribute__((musttail))
    return func(ip, tbl, TOP, FTOP, sp, rp, fp, lp)


    You could possibly use "inline RELOAD() { .... ;}" instead of the macro.

    and for the tailcall version
    mov rax, qword ptr [r13 + 8*rax]
    rex64 jmp rax

    I wonder why the tailcall version didn't combine the mov with the jmp
    like the other version did.

    It also turns out that the musttail attribute is not necessary
    It will generate a tailcall aanyway. The difference is that with
    musttail it will report an error if it cannot do the tailcall.

    Yes, TCO has been present since the beginning but it's been
    opportunistic rather than something you can rely on.

    Unfortunately GCC does not recognize preserve_none and uses the stack
    for some parameters

    Oh that's interesting. I half remember there being some other feature
    for that, but who knows. Does -fwhole-program help?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Tue Jan 20 00:33:00 2026
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    // VM8 C variant using tailcalls ...
    #define FUNC __attribute__((preserve_none)) void

    Can you add static to that? It stops the symbol from being exported, so
    the compiler can omit the function call sequence when appropriate.

    I think -fwhole-program isn't likely to work so it wasn't a helpful
    suggestion, sorry.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Tue Jan 20 00:35:35 2026
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl writes:
    If you pass an address a as a tail call is it approximately equal
    to coroutines:

    No I don't think so. The tail call is just a jump to that address
    (changes the program counter). A coroutine jump also has to change the
    stack pointer. See the section "Knuth's coroutines" here:

    https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

    Some Forths have a CO primitive that I think is similar. There is
    something like it on the Greenarrays processor.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Tue Jan 20 10:44:40 2026
    From Newsgroup: comp.lang.forth

    On Mon, 19 Jan 2026 15:22:07 -0800
    Paul Rubin <no.email@nospam.invalid> wrote:

    peter <peter.noreply@tin.it> writes:
    for the tail call version it was changed to

    RELOAD() opcode func=(opcode)tbl[*ip++]; __attribute__((musttail))
    return func(ip, tbl, TOP, FTOP, sp, rp, fp, lp)


    You could possibly use "inline RELOAD() { .... ;}" instead of the macro.

    No that did not work. I also do not want the compiler to mess with this.
    The pre-processor expansion does exactly what I want.
    Musttail requires the parameters to exactly match on the incoming and
    outgoing calls. Like if it is a recursive call.



    and for the tailcall version
    mov rax, qword ptr [r13 + 8*rax]
    rex64 jmp rax

    I wonder why the tailcall version didn't combine the mov with the jmp
    like the other version did.

    I do also wonder about that! From my previous testing it will not make a difference speed-wise.


    It also turns out that the musttail attribute is not necessary
    It will generate a tailcall aanyway. The difference is that with
    musttail it will report an error if it cannot do the tailcall.

    Yes, TCO has been present since the beginning but it's been
    opportunistic rather than something you can rely on.

    Unfortunately GCC does not recognize preserve_none and uses the stack
    for some parameters

    It looks like it recognizes it but choose to ignore it.
    That is what the warning messages say.


    Oh that's interesting. I half remember there being some other feature
    for that, but who knows. Does -fwhole-program help?

    I will for sure continue to use the computed goto also in the future.
    The complete VM8 function containing 157 opcodes is about 1200
    lines of code. 255 are for the function-array. that leaves 945 lines
    for 157 opcode, about 6 lines per opcode!

    BR
    Peter

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Jan 20 12:12:44 2026
    From Newsgroup: comp.lang.forth

    In article <87bjioptpk.fsf@nightsong.com>,
    Paul Rubin <no.email@nospam.invalid> wrote:
    albert@spenarnc.xs4all.nl writes:
    If you pass an address a as a tail call is it approximately equal
    to coroutines:

    No I don't think so. The tail call is just a jump to that address
    (changes the program counter). A coroutine jump also has to change the
    stack pointer. See the section "Knuth's coroutines" here:

    Which stack pointer do you mean? The data stack pointer or the
    return stack pointer?
    Where is the program to continue after performing the tail call?
    Probably the same if the tail call was not present.

    Pushing a address on the return stack, then continue interpreting
    is tantamount to a jump.


    https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

    Some Forths have a CO primitive that I think is similar. There is
    something like it on the Greenarrays processor.

    You can see the CO primitive used in the example. The CO name is
    original by me, Chuck Moore uses ;: , not implying they are exactly
    the same.

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Jan 20 22:17:45 2026
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    On Fri, 16 Jan 2026 23:10:24 -0800
    Paul Rubin <no.email@nospam.invalid> wrote:
    The VM was written from the begining in X64 assembler (13 years ago)
    4 years ago I also implemented the VM i C to simplify porting to ARM64.
    At that time the asm version was about 10% faster then the generated
    C code, today the speed is about the same. C compilers have improved.

    My impression is that they have not really improved in decades, for
    the kinds of code in Gforth (and, I guess NTF64/LXF64). Except that
    we can now use one or two registers more on AMD64 if we do things
    right.

    Much more important is the __attribute__((preserve_none)) before
    each function. This indicated that more registers will be used to pass >parameters. As seen above I pass 8 parameters to each function and
    they need to be in registers to match the asmbler written code.
    This is done automatically in the goto version as everything is in
    one function there.

    In the end it is more how you like to write your VM, as one function
    or one for each "opcode".

    Unfortunately GCC does not recognize preserve_none and uses the stack
    for some parameters

    With gcc you can use explicit register variables instead.

    In any case, the tail-calling technique is not as portable as I would
    like. Depending on the compiler and ABI/architecture, one wants to
    pass more or fewer VM registers as parameters, and maybe deal with the
    rest with explicit register variables. Sure, we can find ways to
    parameterize this stuff so the main body of the code does not see the difference, but working on a new architecture out of the box will
    either not work or require a lot of work.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Jan 20 22:36:05 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    peter <peter.noreply@tin.it> writes:
    and for the tailcall version
    mov rax, qword ptr [r13 + 8*rax]
    rex64 jmp rax

    I wonder why the tailcall version didn't combine the mov with the jmp
    like the other version did.

    I always wonder about that when I see the code generated by gcc for
    goto *. With any gcc since 3.0, I see direct-threaded dispatch
    compiled to code like:

    add $0x8,%rbx
    mov (%rbx),%rax
    jmp *%rax

    GCC-2.95 and earlier know how to combine the last two instructions into

    jmp (%rbx)

    Yes, TCO has been present since the beginning but it's been
    opportunistic rather than something you can rely on.

    I tried to use tail-call optimization for threaded-code dispatch in
    gcc in 1995, and even described it as theoretical possibility
    [ertl95pldi], but gcc of that time did not tail-call optimize code
    like that shown by Peter.

    @InProceedings{ertl95pldi,
    author = "M. Anton Ertl",
    title = "Stack Caching for Interpreters",
    booktitle = "SIGPLAN Conference on Programming Language
    Design and Implementation (PLDI'95)",
    year = "1995",
    crossref = "sigplan95",
    pages = "315--327",
    url = "https://www.complang.tuwien.ac.at/papers/ertl95pldi.ps.gz",
    abstract = "An interpreter can spend a significant part of its
    execution time on arguments of virtual machine
    instructions. This paper explores two methods to
    reduce this overhead for virtual stack machines by
    caching top-of-stack values in (real machine)
    registers. The {\em dynamic method} is based on
    having, for every possible state of the cache, one
    specialized version of the whole interpreter; the
    execution of an instruction usually changes the
    state of the cache and the next instruction is
    executed in the version corresponding to the new
    state. In the {\em static method} a state machine
    that keeps track of the cache state is added to the
    compiler. Common instructions exist in specialized
    versions for several states, but it is not necessary
    to have a version of every instruction for every
    cache state. Stack manipulation instructions are
    optimized away."
    }

    Unfortunately GCC does not recognize preserve_none and uses the stack
    for some parameters

    Oh that's interesting. I half remember there being some other feature
    for that, but who knows.

    Explicit register variables.

    Does -fwhole-program help?

    Unlikely. How should it?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Thu Jan 22 16:51:13 2026
    From Newsgroup: comp.lang.forth

    On 16-01-2026 18:38, Anton Ertl wrote:

    On 17-01-2026 16:58, Hans Bezemer wrote:

    I've done my thing, compiled 4tH with optimizations -O3 till -O0.
    I thought, let's make this simple and execute ALL benchmarks I got. Some
    of them have become useless, though for the simple reason hardware has
    become that much better.

    But still, here it is. Overall, the performance consistently
    deteriorates, aka -O3 gives the best performance. There are a few minor glitches, some due to random benchmark data.

    For those curious, this is a European CSV with all the data. BTW, you
    can find all benchmarks here: https://sourceforge.net/p/forth-4th/code/HEAD/tree/trunk/4th.src/bench/

    Hans Bezemer

    ---8<---
    Benchmark;-O3;-O2;-O1;-O0
    bench.4th;6.79;6.36;6.68;6.33
    benchm.4th;1.21;1.66;1.86;2.8
    benchxls.4th;0.06;0.08;0.08;0.12
    bubble.4th;0.69;0.95;0.96;1.72
    bytesiev.4th;0.01;0.01;0.01;0.02
    countbit.4th;3.52;4.76;5.02;8.01
    cowell.4th;15.15;20.2;18.91;31.29
    fib.4th;0.79;1.02;1.02;1.72
    isortest.4th;0.23;0.33;0.31;0.56
    matrix.4th;0.22;0.31;0.3;0.51
    misty.4th;0.58;0.84;1.01;1.59
    pforth.4th;10.47;13.55;14.42;22.68
    prims.4th;5.96;8;8.59;14.28
    simple.4th;0.5;0.7;0.82;1.21
    sortest.4th;140.96;163.68;150.17;270.87
    thread.4th;0.35;0.41;0.49;0.7
    ---8<---

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    On 15-01-2026 13:04, Anton Ertl wrote:

    A few observations concerning the IMHO most interesting paper,
    "Code-Copying Compilation in Production":
    ...
    3. Commercial compilers (partly) using conventional compilers (see TF,
    fig. 4.7) - that was new to me;

    All Forth compilers I know work at the text interpretation level as
    the "Forth compiler" of Thinking Forth, Figure 4.7.

    4. GCC -O1 outperforming GCC -O3 on some benchmarks. That's new to me
    too. I might experiment with that one;

    I have analyzed it for bubblesort. There the problem is that gcc -O3 auto-vectorizes the pair of loads and the pair of stores (when the two elements are swapped). As a result, if a pair is stored in one
    iteration, the next iteration loads a pair that overlaps the
    previously stored pair. This means that the hardware cannot use its
    fast path in store-to-load forwarding, and leads to a huge slowdown.
    For a benchmark that has been around for over 40 years.

    In addition, the code generated by gcc -O3 also executes several
    additonal instructions per iteration, so I doubt that it would be
    faster even if the store-to-load forwarding problem did not exist.

    For fib, I have also looked at the generated code, but have not
    understood it well enough to see why the code generated by gcc -O3 is
    slower.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jan 24 11:28:30 2026
    From Newsgroup: comp.lang.forth

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    I've done my thing, compiled 4tH with optimizations -O3 till -O0.
    I thought, let's make this simple and execute ALL benchmarks I got. Some
    of them have become useless, though for the simple reason hardware has >become that much better.

    But still, here it is. Overall, the performance consistently
    deteriorates, aka -O3 gives the best performance.

    Which compiler and which hardware?

    For a random program, I would expect higher optimization levels to
    produe faster code. For a Forth system and these recent gccs, the auto-vectorization of adjacent memory accesses may lead to similar
    problems as in the C bubble-sort benchmark. In Gforth, this actually
    happens unless we disable vectorization (which we normally do), and,
    moreover, with the vectorized code, gcc introduces additional
    inefficiencies (see below).

    Here's the output of ./gforth-fast onebench.fs compiled from the
    current development version with gcc-12.2 and running on a Ryzen 5800X
    (numbers are times, lower is better):

    sieve bubble matrix fib fft gcc options
    0.025 0.023 0.013 0.033 0.016 -O2
    0.025 0.023 0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default)
    0.404 0.418 0.377 0.472 0.244 -O3 (with auto vectorization)
    0.145 0.122 0.124 0.122 0.073 gforth default, using --no-dynamic

    So how is the code different? Here's the code for ROT:


    -O3 (auto-vectorized) -O3 -fno-tree-vec... -O2
    add $0x8,%rbx add $0x8,%rbx add $0x8,%rbx
    movq 0x8(%r10),%xmm1 mov 0x8(%r10),%rdx mov 0x8(%r10),%rdx
    mov 0x10(%r10),%rcx mov 0x10(%r10),%rax mov 0x10(%r10),%rax
    punpcklqdq %xmm1,%xmm1 mov %r13,0x8(%r10) mov %r13,0x8(%r10)
    punpckhqdq %xmm1,%xmm0 mov %rdx,0x10(%r10) mov %rdx,0x10(%r10)
    movups %xmm0,0x8(%r10) mov %rax,%r13 mov %rax,%r13
    mov (%rbx),%rax mov (%rbx),%rax mov (%rbx),%rax
    mov %r14,0x8(%rsp) jmp *%rax jmp *%rax
    mov %rax,%r11
    mov %r15,%r9
    mov %rcx,0x10(%rsp)
    jmp 0x55bff2a58a99

    So in this case -O3 without auto-vectorization generates the same code
    as -O2. Auto-vectorization, OTOH, replaces

    mov 0x8(%r10),%rdx
    mov 0x10(%r10),%rax

    with

    movq 0x8(%r10),%xmm1

    and then performs the rotation with the punpck instructions, finally
    storing two cells into memory with movups. For some reason it also
    separately loads 0x10(%r10) into %rcx (instead of extracting it from
    %xmm1), and eventually stores it to 0x10(%rsp), which seems to be one
    of the locations of the TOS.

    I expect that gcc's auto-vectorization will do similar things to
    primitives like ROT 2! 2SWAP (all of which are hit in gforth) in other
    Forth systems with a C substrate, because they all tend to access two
    (or more) adjacent cells.

    But the big hit with the auto-vectorized code is not these changes,
    but what happens at the end of the primitive: without
    auto-vectorization there is the indirect jump of the threaded-code
    dispatch, but with auto-vectorization it jumps to 0x55bff2a58a99:

    0x000055bff2a58a99 <gforth_engine2+153>: movq 0x8(%rsp),%xmm0
    0x000055bff2a58a9f <gforth_engine2+159>: movq %r9,%xmm1
    0x000055bff2a58aa4 <gforth_engine2+164>: movhps 0x8(%rsp),%xmm1
    0x000055bff2a58aa9 <gforth_engine2+169>: movhps 0x10(%rsp),%xmm0
    0x000055bff2a58aae <gforth_engine2+174>: movhlps %xmm0,%xmm5
    0x000055bff2a58ab1 <gforth_engine2+177>: movq %xmm0,%r14
    0x000055bff2a58ab6 <gforth_engine2+182>: movq %xmm1,%r15
    0x000055bff2a58abb <gforth_engine2+187>: movhps %xmm1,0x18(%rsp)
    0x000055bff2a58ac0 <gforth_engine2+192>: movq %xmm5,%r8
    0x000055bff2a58ac5 <gforth_engine2+197>: mov %r15,%rdi
    0x000055bff2a58ac8 <gforth_engine2+200>: mov %r14,%rsi
    0x000055bff2a58acb <gforth_engine2+203>: mov %r8,%rcx
    0x000055bff2a58ace <gforth_engine2+206>: jmp *%r11

    We can see here that, among other things 0x10(%rsp) (the TOS) is
    loaded into %xmm0 and then moved through %xmm5 into %r8 and the %rcx,
    as well as through %r14 into %rsi so at the end TOS resides in all
    those places. And I see that other primitives expect the TOS in some
    of those places, e.g. 1+:

    -O3 (auto-vectorized) -O3 -fno-tree-vec...
    add $0x8,%rbx add $0x8,%rbx
    lea 0x1(%r8),%rcx add $0x1,%r13
    mov (%rbx),%rax mov (%rbx),%rax
    mov %r14,0x8(%rsp) jmp *%rax
    mov %rax,%r11
    mov %r15,%r9
    mov %rcx,0x10(%rsp)
    jmp 0x55bff2a58a99

    Jumping to 0x55bff2a58a99 instead of performing an indirect jump
    disables dynamic native code generation in Gforth and all the
    optimizations that are based on it. You can see in the --no-dynamic
    line how much that costs. The remaining factor of 3 is probably due
    to the large number of additional instructions that are performed in
    the auto-vectorized engine.

    What is the 4th code for ROT with -O2 and -O3?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Jan 24 16:47:16 2026
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    I've done my thing, compiled 4tH with optimizations -O3 till -O0.
    I thought, let's make this simple and execute ALL benchmarks I got. Some >>of them have become useless, though for the simple reason hardware has >>become that much better.

    But still, here it is. Overall, the performance consistently
    deteriorates, aka -O3 gives the best performance.

    Which compiler and which hardware?

    For a random program, I would expect higher optimization levels to
    produe faster code. For a Forth system and these recent gccs, the >auto-vectorization of adjacent memory accesses may lead to similar
    problems as in the C bubble-sort benchmark. In Gforth, this actually
    happens unless we disable vectorization (which we normally do), and, >moreover, with the vectorized code, gcc introduces additional
    inefficiencies (see below).

    Here's the output of ./gforth-fast onebench.fs compiled from the
    current development version with gcc-12.2 and running on a Ryzen 5800X >(numbers are times, lower is better):

    sieve bubble matrix fib fft gcc options
    0.025 0.023 0.013 0.033 0.016 -O2
    0.025 0.023 0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default) 0.404 0.418 0.377 0.472 0.244 -O3 (with auto vectorization)
    0.145 0.122 0.124 0.122 0.073 gforth default, using --no-dynamic

    I have now also tried it with gcc-14.2, and that produces better code.
    Results from a Xeon E-2388G (Rocket Lake):

    sieve bubble matrix fib fft gcc options
    0.032 0.032 0.015 0.037 0.014 -O2
    0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
    0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)

    The code for ROT and 2SWAP does not use auto-vectorization, and the
    code for 2! uses auto-vectorization in a way that reduces the
    instruction count:

    -O3 (auto-vectorized) -O3 -fno-tree-vectorize
    add $0x8,%rbx add $0x8,%rbx
    movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
    add $0x18,%r13 mov 0x8(%r13),%rdx
    movhps -0x8(%r13),%xmm0 add $0x18,%r13
    movups %xmm0,(%r8) mov %rdx,(%r8)
    mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
    mov (%rbx),%rax mov 0x0(%r13),%r8
    jmp *%rax mov (%rbx),%rax
    jmp *%rax

    And the common tail with all these move instructions is gone.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Sun Jan 25 23:31:10 2026
    From Newsgroup: comp.lang.forth

    On Sat, 24 Jan 2026 16:47:16 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    I've done my thing, compiled 4tH with optimizations -O3 till -O0.
    I thought, let's make this simple and execute ALL benchmarks I got. Some >>of them have become useless, though for the simple reason hardware has >>become that much better.

    But still, here it is. Overall, the performance consistently >>deteriorates, aka -O3 gives the best performance.

    Which compiler and which hardware?

    For a random program, I would expect higher optimization levels to
    produe faster code. For a Forth system and these recent gccs, the >auto-vectorization of adjacent memory accesses may lead to similar
    problems as in the C bubble-sort benchmark. In Gforth, this actually >happens unless we disable vectorization (which we normally do), and, >moreover, with the vectorized code, gcc introduces additional >inefficiencies (see below).

    Here's the output of ./gforth-fast onebench.fs compiled from the
    current development version with gcc-12.2 and running on a Ryzen 5800X >(numbers are times, lower is better):

    sieve bubble matrix fib fft gcc options
    0.025 0.023 0.013 0.033 0.016 -O2
    0.025 0.023 0.013 0.037 0.016 -O3 -fno-tree-vectorize (gforth default) 0.404 0.418 0.377 0.472 0.244 -O3 (with auto vectorization)
    0.145 0.122 0.124 0.122 0.073 gforth default, using --no-dynamic

    I have now also tried it with gcc-14.2, and that produces better code. Results from a Xeon E-2388G (Rocket Lake):

    sieve bubble matrix fib fft gcc options
    0.032 0.032 0.015 0.037 0.014 -O2
    0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
    0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)

    The code for ROT and 2SWAP does not use auto-vectorization, and the
    code for 2! uses auto-vectorization in a way that reduces the
    instruction count:

    -O3 (auto-vectorized) -O3 -fno-tree-vectorize
    add $0x8,%rbx add $0x8,%rbx
    movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
    add $0x18,%r13 mov 0x8(%r13),%rdx
    movhps -0x8(%r13),%xmm0 add $0x18,%r13
    movups %xmm0,(%r8) mov %rdx,(%r8)
    mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
    mov (%rbx),%rax mov 0x0(%r13),%r8
    jmp *%rax mov (%rbx),%rax
    jmp *%rax

    And the common tail with all these move instructions is gone.

    - anton

    What does your C code looks like? I could not get clang or gcc to auto vectories
    with my existing code

    UNS64 *tmp64 = (UNS64*)TOP;
    tmp64[0] = sp[0];
    tmp64[1] = sp[1];
    TOP = sp[2];
    sp += 3;


    In the end I changed my code to tell the compiler that it is a vector with

    typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));

    and
    *(v2u64*)TOP = *(v2u64*)sp;
    TOP=sp[2];
    sp=sp+3;

    this will produce

    vmovups xmm0, xmmword ptr [rdx]
    vmovups xmmword ptr [r8], xmm0
    mov r8, qword ptr [rdx + 16]
    add rdx, 24

    movzx r9d, byte ptr [rcx] // nesting code
    inc rcx
    jmp qword ptr [rax + 8*r9]

    But also using memcpy((UNS64*)TOP, (UNS64*)sp,16); gives the same code!

    Looks like it is working also in ARM64
    BR
    Peter

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Jan 26 19:24:43 2026
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    On Sat, 24 Jan 2026 16:47:16 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    I have now also tried it with gcc-14.2, and that produces better code.
    Results from a Xeon E-2388G (Rocket Lake):

    sieve bubble matrix fib fft gcc options
    0.032 0.032 0.015 0.037 0.014 -O2
    0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default)
    0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)

    The code for ROT and 2SWAP does not use auto-vectorization, and the
    code for 2! uses auto-vectorization in a way that reduces the
    instruction count:

    -O3 (auto-vectorized) -O3 -fno-tree-vectorize
    add $0x8,%rbx add $0x8,%rbx
    movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
    add $0x18,%r13 mov 0x8(%r13),%rdx
    movhps -0x8(%r13),%xmm0 add $0x18,%r13
    movups %xmm0,(%r8) mov %rdx,(%r8)
    mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
    mov (%rbx),%rax mov 0x0(%r13),%r8
    jmp *%rax mov (%rbx),%rax
    jmp *%rax

    And the common tail with all these move instructions is gone.

    - anton

    What does your C code looks like? I could not get clang or gcc to auto vectories
    with my existing code

    UNS64 *tmp64 = (UNS64*)TOP;
    tmp64[0] = sp[0];
    tmp64[1] = sp[1];
    TOP = sp[2];
    sp += 3;

    Gforth's source code for 2! is:

    2! ( w1 w2 a_addr -- ) core two_store
    ""Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell."" a_addr[0] = w2;
    a_addr[1] = w1;

    A generator produces the following from that, which is passed to gcc:

    LABEL(two_store) /* 2! ( w1 w2 a_addr -- ) S1 -- S1 */
    /* Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell. */ NAME("2!")
    ip += 1;
    LABEL1(two_store)
    {
    DEF_CA
    MAYBE_UNUSED Cell w1;
    MAYBE_UNUSED Cell w2;
    MAYBE_UNUSED Cell * a_addr;
    NEXT_P0;
    vm_Cell2w(sp[2],w1);
    vm_Cell2w(sp[1],w2);
    vm_Cell2a_(spTOS,a_addr);
    #ifdef VM_DEBUG
    if (vm_debug) {
    fputs(" w1=", vm_out); printarg_w(w1);
    fputs(" w2=", vm_out); printarg_w(w2);
    fputs(" a_addr=", vm_out); printarg_a_(a_addr);
    }
    #endif
    sp += 3;
    {
    #line 1815 "prim"
    a_addr[0] = w2;
    a_addr[1] = w1;
    #line 10136 "prim-fast.i"
    }

    #ifdef VM_DEBUG
    if (vm_debug) {
    fputs(" -- ", vm_out); fputc('\n', vm_out);
    }
    #endif
    NEXT_P1;
    spTOS = sp[0];
    LABEL2(two_store)
    NAME1("l2-two_store")
    NEXT_P1_5;
    LABEL3(two_store)
    NAME1("l3-two_store")
    DO_GOTO;
    }

    There are a lot of macros in this code, and I fear that expanding them
    makes the code even less readable, but the essence for the
    auto-vectorized part is something like:

    w1 = sp[2];
    w2 = sp[1];
    a_addr = spTOS;
    sp += 3;
    a_addr[0] = w2;
    a_addr[1] = w1;
    spTOS = sp[0];

    My guess is that in your code the compiler expected that sp[1] might
    alias with tmp64[0], and therefore did not vectorize the loads and the
    stores, whereas in the Gforth code, the loads both happen first, and
    then the two stores, and gcc can vectorize that. I doubt that there
    is a big benefit from that, though.

    typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));

    I'll have to remember the aligned attribute for future games with gcc
    explicit vectorization.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jan 29 18:27:12 2026
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    On Mon, 26 Jan 2026 19:24:43 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    peter <peter.noreply@tin.it> writes:
    On Sat, 24 Jan 2026 16:47:16 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    The code for ROT and 2SWAP does not use auto-vectorization, and the
    code for 2! uses auto-vectorization in a way that reduces the
    instruction count:

    -O3 (auto-vectorized) -O3 -fno-tree-vectorize
    add $0x8,%rbx add $0x8,%rbx
    movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
    add $0x18,%r13 mov 0x8(%r13),%rdx
    movhps -0x8(%r13),%xmm0 add $0x18,%r13
    movups %xmm0,(%r8) mov %rdx,(%r8)
    mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
    mov (%rbx),%rax mov 0x0(%r13),%r8
    jmp *%rax mov (%rbx),%rax
    jmp *%rax
    ...
    UNS64 *tmp64 = (UNS64*)TOP;
    UNS64 d0=sp[0];
    UNS64 d1=sp[1];
    tmp64[0] = d0;
    tmp64[1] = d1;
    TOP = sp[2];
    sp += 3;

    made the compiler (clang-21 in this case) generate the expected code

    The auto-vectorized implementation of 2! above should perform ok,
    because it loads each stack item separately, and the wide movups is
    only used for the stores. If there is a wide load from the stack
    involved, I expect a significant slowdown, because the stack items
    usually have been stored recently, and narrow-store-to-wide-load
    forwarding is a slow path on recent (and presumably also older) CPU
    cores: https://www.complang.tuwien.ac.at/anton/stwlf/

    typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));

    I'll have to remember the aligned attribute for future games with gcc
    explicit vectorization.

    Without that it will generate the opcodes that needs 16 byte alignment

    Yes. Until now I worked around that by using memcpy to a vector
    variable, but this approach is much more convenient.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Fri Jan 30 13:20:35 2026
    From Newsgroup: comp.lang.forth

    In article <2026Jan29.192712@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    peter <peter.noreply@tin.it> writes:
    On Mon, 26 Jan 2026 19:24:43 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    peter <peter.noreply@tin.it> writes:
    On Sat, 24 Jan 2026 16:47:16 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    The code for ROT and 2SWAP does not use auto-vectorization, and the
    code for 2! uses auto-vectorization in a way that reduces the
    instruction count:

    -O3 (auto-vectorized) -O3 -fno-tree-vectorize
    add $0x8,%rbx add $0x8,%rbx
    movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
    add $0x18,%r13 mov 0x8(%r13),%rdx
    movhps -0x8(%r13),%xmm0 add $0x18,%r13
    movups %xmm0,(%r8) mov %rdx,(%r8)
    mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
    mov (%rbx),%rax mov 0x0(%r13),%r8
    jmp *%rax mov (%rbx),%rax
    jmp *%rax
    ...
    UNS64 *tmp64 = (UNS64*)TOP;
    UNS64 d0=sp[0];
    UNS64 d1=sp[1];
    tmp64[0] = d0;
    tmp64[1] = d1;
    TOP = sp[2];
    sp += 3;

    made the compiler (clang-21 in this case) generate the expected code

    The auto-vectorized implementation of 2! above should perform ok,
    because it loads each stack item separately, and the wide movups is
    only used for the stores. If there is a wide load from the stack
    involved, I expect a significant slowdown, because the stack items
    usually have been stored recently, and narrow-store-to-wide-load
    forwarding is a slow path on recent (and presumably also older) CPU
    cores: https://www.complang.tuwien.ac.at/anton/stwlf/

    typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));

    I'll have to remember the aligned attribute for future games with gcc
    explicit vectorization.

    Without that it will generate the opcodes that needs 16 byte alignment

    Yes. Until now I worked around that by using memcpy to a vector
    variable, but this approach is much more convenient.

    I always wonder, is this relevant to the industrial applications of
    gforth or gforth based programs that are sold commercially?


    - anton
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jan 30 18:00:31 2026
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl writes:
    I always wonder, is this relevant to the industrial applications of
    gforth or gforth based programs that are sold commercially?

    Are there any Gforth-based programs that are sold commercially?
    Concerning industrial applications, the only ones I know about have to
    do with Open Firmware, and I doubt that those care much about the
    performance of Gforth. But there are probably industrial applications
    (maybe even commercial programs) that use Gforth that I do not know
    about. If one of the IBM users had not contacted us when he left the
    group, I would not know about the application of Gforth within IBM.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Tue Jan 27 15:44:55 2026
    From Newsgroup: comp.lang.forth

    On Mon, 26 Jan 2026 19:24:43 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    peter <peter.noreply@tin.it> writes:
    On Sat, 24 Jan 2026 16:47:16 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    I have now also tried it with gcc-14.2, and that produces better code.
    Results from a Xeon E-2388G (Rocket Lake):

    sieve bubble matrix fib fft gcc options
    0.032 0.032 0.015 0.037 0.014 -O2
    0.035 0.032 0.015 0.037 0.014 -O3 -fno-tree-vectorize (gforth default) >> 0.033 0.034 0.016 0.032 0.014 -O3 (with auto vectorization)

    The code for ROT and 2SWAP does not use auto-vectorization, and the
    code for 2! uses auto-vectorization in a way that reduces the
    instruction count:

    -O3 (auto-vectorized) -O3 -fno-tree-vectorize
    add $0x8,%rbx add $0x8,%rbx
    movq 0x8(%r13),%xmm0 mov 0x10(%r13),%rax
    add $0x18,%r13 mov 0x8(%r13),%rdx
    movhps -0x8(%r13),%xmm0 add $0x18,%r13
    movups %xmm0,(%r8) mov %rdx,(%r8)
    mov 0x0(%r13),%r8 mov %rax,0x8(%r8)
    mov (%rbx),%rax mov 0x0(%r13),%r8
    jmp *%rax mov (%rbx),%rax
    jmp *%rax

    And the common tail with all these move instructions is gone.

    - anton

    What does your C code looks like? I could not get clang or gcc to auto vectories
    with my existing code

    UNS64 *tmp64 = (UNS64*)TOP;
    tmp64[0] = sp[0];
    tmp64[1] = sp[1];
    TOP = sp[2];
    sp += 3;

    Gforth's source code for 2! is:

    2! ( w1 w2 a_addr -- ) core two_store
    ""Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell."" a_addr[0] = w2;
    a_addr[1] = w1;

    A generator produces the following from that, which is passed to gcc:

    LABEL(two_store) /* 2! ( w1 w2 a_addr -- ) S1 -- S1 */
    /* Store @i{w2} into the cell at @i{c-addr} and @i{w1} into the next cell. */ NAME("2!")
    ip += 1;
    LABEL1(two_store)
    {
    DEF_CA
    MAYBE_UNUSED Cell w1;
    MAYBE_UNUSED Cell w2;
    MAYBE_UNUSED Cell * a_addr;
    NEXT_P0;
    vm_Cell2w(sp[2],w1);
    vm_Cell2w(sp[1],w2);
    vm_Cell2a_(spTOS,a_addr);
    #ifdef VM_DEBUG
    if (vm_debug) {
    fputs(" w1=", vm_out); printarg_w(w1);
    fputs(" w2=", vm_out); printarg_w(w2);
    fputs(" a_addr=", vm_out); printarg_a_(a_addr);
    }
    #endif
    sp += 3;
    {
    #line 1815 "prim"
    a_addr[0] = w2;
    a_addr[1] = w1;
    #line 10136 "prim-fast.i"
    }

    #ifdef VM_DEBUG
    if (vm_debug) {
    fputs(" -- ", vm_out); fputc('\n', vm_out);
    }
    #endif
    NEXT_P1;
    spTOS = sp[0];
    LABEL2(two_store)
    NAME1("l2-two_store")
    NEXT_P1_5;
    LABEL3(two_store)
    NAME1("l3-two_store")
    DO_GOTO;
    }

    There are a lot of macros in this code, and I fear that expanding them
    makes the code even less readable, but the essence for the
    auto-vectorized part is something like:

    w1 = sp[2];
    w2 = sp[1];
    a_addr = spTOS;
    sp += 3;
    a_addr[0] = w2;
    a_addr[1] = w1;
    spTOS = sp[0];

    My guess is that in your code the compiler expected that sp[1] might
    alias with tmp64[0], and therefore did not vectorize the loads and the stores, whereas in the Gforth code, the loads both happen first, and
    then the two stores, and gcc can vectorize that. I doubt that there
    is a big benefit from that, though.

    Yes that was it. changing to:

    UNS64 *tmp64 = (UNS64*)TOP;
    UNS64 d0=sp[0];
    UNS64 d1=sp[1];
    tmp64[0] = d0;
    tmp64[1] = d1;
    TOP = sp[2];
    sp += 3;

    made the compiler (clang-21 in this case) generate the expected code



    typedef UNS64 v2u64 __attribute__((vector_size(16))) __attribute__((aligned(8)));

    I'll have to remember the aligned attribute for future games with gcc explicit vectorization.

    Without that it will generate the opcodes that needs 16 byte alignment

    BR
    Peter

    - anton


    --- Synchronet 3.21b-Linux NewsLink 1.2