• Re: Parsing =?UTF-8?B?dGltZXN0YW1wcz8=?=

    From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Sun Jul 6 07:50:13 2025
    From Newsgroup: comp.lang.forth

    Recently someone told me about Christianity - how it wasn't meant to be
    easy - supposed to be, among other things, a denial of the senses.

    "...I wished to be quite fair then, and I wish to be quite fair now;
    and I did not conclude that the attack on Christianity was all wrong.
    I only concluded that if Christianity was wrong, it was very
    wrong indeed. Such hostile horrors might be combined in one thing,
    but that thing must be very strange and solitary. There are men
    who are misers, and also spendthrifts; but they are rare. There are
    men sensual and also ascetic; but they are rare. But if this mass
    of mad contradictions really existed, quakerish and bloodthirsty,
    too gorgeous and too thread-bare, austere, yet pandering preposterously
    to the lust of the eye, the enemy of women and their foolish refuge,
    a solemn pessimist and a silly optimist, if this evil existed,
    then there was in this evil something quite supreme and unique.
    For I found in my rationalist teachers no explanation of such
    exceptional corruption. Christianity (theoretically speaking)
    was in their eyes only one of the ordinary myths and errors of mortals.
    THEY gave me no key to this twisted and unnatural badness.
    Such a paradox of evil rose to the stature of the supernatural.
    It was, indeed, almost as supernatural as the infallibility of the Pope.
    An historic institution, which never went right, is really quite
    as much of a miracle as an institution that cannot go wrong.
    The only explanation which immediately occurred to my mind was that Christianity did not come from heaven, but from hell. Really, if Jesus
    of Nazareth was not Christ, He must have been Antichrist.

    And then in a quiet hour a strange thought struck me like a still thunderbolt. There had suddenly come into my mind another explanation.
    Suppose we heard an unknown man spoken of by many men. Suppose we
    were puzzled to hear that some men said he was too tall and some
    too short; some objected to his fatness, some lamented his leanness;
    some thought him too dark, and some too fair. One explanation (as
    has been already admitted) would be that he might be an odd shape.
    But there is another explanation. He might be the right shape.
    Outrageously tall men might feel him to be short. Very short men
    might feel him to be tall. Old bucks who are growing stout might
    consider him insufficiently filled out; old beaux who were growing
    thin might feel that he expanded beyond the narrow lines of elegance.
    Perhaps Swedes (who have pale hair like tow) called him a dark man,
    while negroes considered him distinctly blonde. Perhaps (in short)
    this extraordinary thing is really the ordinary thing; at least
    the normal thing, the centre. Perhaps, after all, it is Christianity
    that is sane and all its critics that are mad -- in various ways.
    I tested this idea by asking myself whether there was about any
    of the accusers anything morbid that might explain the accusation.
    I was startled to find that this key fitted a lock. For instance,
    it was certainly odd that the modern world charged Christianity
    at once with bodily austerity and with artistic pomp. But then
    it was also odd, very odd, that the modern world itself combined
    extreme bodily luxury with an extreme absence of artistic pomp.
    The modern man thought Becket's robes too rich and his meals too poor.
    But then the modern man was really exceptional in history; no man before
    ever ate such elaborate dinners in such ugly clothes. The modern man
    found the church too simple exactly where modern life is too complex;
    he found the church too gorgeous exactly where modern life is too dingy.
    The man who disliked the plain fasts and feasts was mad on entrees.
    The man who disliked vestments wore a pair of preposterous trousers.
    And surely if there was any insanity involved in the matter at all it
    was in the trousers, not in the simply falling robe. If there was any
    insanity at all, it was in the extravagant entrees, not in the bread
    and wine.

    I went over all the cases, and I found the key fitted so far."

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Mon Jul 7 00:05:14 2025
    From Newsgroup: comp.lang.forth

    On Sat, 5 Jul 2025 14:24:37 +0000, minforth wrote:

    Am 05.07.2025 um 14:41 schrieb minforth:
    Am 05.07.2025 um 14:21 schrieb albert@spenarnc.xs4all.nl:
    I investigated the instruction set, and I found no way to detect
    if the 8 registers stack is full.
    This would offer the possibility to spill registers to memory only
    if it is needed.


    IIRC signaling and handling fp-stack overflow is not an easy task.
    At most, the computer would crash.
    IOW, spilling makes sense.

    A deep dive into the manual

    ... the C1 condition code flag is used for a variety of functions.
    When both the IE and SF flags in the x87 FPU status word are set,
    indicating a stack overflow or underflow exception (#IS), the C1
    flag distinguishes between overflow (C1=1) and underflow (C1=0).

    This definitely does not work (I tried it). That manual is fabulating.

    iForth has its FP stack in memory. However, inside colon definitions
    the compiler tracks the hardware stack. Only when inlining is not
    possible, or would lead to excessive size, HW stack items are
    flushed/reloaded to/from memory. Anyway, a software stack is necessary
    when calling C libraries or the OS.

    The mental cost of writing the FP compiler was insane, but I
    found it justified.

    BTW, the transputer had the same problem (and solution).

    -marcel

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Fri Jul 11 08:57:09 2025
    From Newsgroup: comp.lang.forth

    On Fri, 11 Jul 2025 7:55:43 +0000, Paul Rubin wrote:

    dxf <dxforth@gmail.com> writes:
    But was it the case by the mid/late 70's - or certain individuals saw an
    opportunity to influence the burgeoning microprocessor market? Notions
    of
    single and double precision already existed in software floating point -

    Hardware floating point also had single and double precision. The
    really awful 1960s systems were gone by the mid 70s. But there were a
    lot of competing formats, ranging from bad to mostly-ok. VAX floating
    point was mostly ok, DEC wanted IEEE to adopt it, Kahan was ok with
    that, but Intel thought "go for the best possible". Kahan's
    retrospectives on this stuff are good reading:

    What is there not to like with the FPU? It provides 80 bits, which
    is in itself a useful additional format, and should never have problems
    with single and double-precision edge cases. Plus it does all the
    trigonometric and transcendental stuff with a reasonable precision
    out-of-box. The instruction set is very regular and quite Forth-like.
    The only problem is that some languages and companies find it necessary
    to boycott FPU use.

    -marcel

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jul 11 10:22:54 2025
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    What is there not to like with the FPU? It provides 80 bits, which
    is in itself a useful additional format, and should never have problems
    with single and double-precision edge cases.

    If you want to do double precision, using the 387 stack has the
    double-rounding problem <https://en.wikipedia.org/wiki/Rounding#Double_rounding>. Even if you
    limit the mantissa to 53 bits, you still get double rounding when you
    deal with numbers that are denormal numbers in binary64
    representation. Java wanted to give the same results, bit for bit, on
    all hardware, and ran afoul of this until they could switch to SSE2.

    The only problem is that some languages and companies find it necessary
    to boycott FPU use.

    The rest of the industry has standardized on binary64 and binary32,
    and they prefer bit-equivalent results for ease of testing. So as
    soon as SSE2 gave that to them, they flocked to SSE2.

    Another nudge towards binary64 (and binary32) is autovectorization.
    You don't want to get different results depending on whether the
    compiler manages to auto-vectorize a program (and use SSE2 parallel
    (rather than scalar) instructions, AVX, or AVX-512) or not. So you
    also use SSE2 when it fails to auto-vectorize.

    OTOH, e.g., on gcc you can ask for -mfpmath=386, for -mfpmath=sse or
    for -mfpmath=both; or if you define a variable as "long double", it
    will store an 80-bit FP value, and computations involving this
    variable will be done on the 387.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Mon Jul 14 07:21:45 2025
    From Newsgroup: comp.lang.forth

    On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:

    [..] if your implementation performs the same
    bit-exact operations for computing a transcendental function on two
    IEEE 754 compliant platforms, the result will be bit-identical (if it
    is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
    NaNs, if you find a difference, check if the involved values are NaNs.

    When e.g. summing the elements of a DP vector, it is hard to see why
    that couldn't be done on the FPU stack (with 80 bits) before (possibly)
    storing the result to a DP variable in memory. I am not sure that Forth
    users would be able to resist that approach.

    -marcel

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Jul 14 07:50:04 2025
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:

    [..] if your implementation performs the same
    bit-exact operations for computing a transcendental function on two
    IEEE 754 compliant platforms, the result will be bit-identical (if it
    is a number). So just use the same implementations of transcentental
    functions, and your results will be bit-identical; concerning the
    NaNs, if you find a difference, check if the involved values are NaNs.

    When e.g. summing the elements of a DP vector, it is hard to see why
    that couldn't be done on the FPU stack (with 80 bits) before (possibly) >storing the result to a DP variable in memory. I am not sure that Forth
    users would be able to resist that approach.

    The question is: What properties do you want your computation to have?

    1) Bit-identical result to a naively-coded IEEE 754 DP computation?

    2) A more accurate result? How much more accuracy?

    3) More performance?

    If you want 1), there is little alternative to actually performing the operations sequentially, using scalar SSE2 operations.

    If you can live without 1), there's a wide range of options:

    A) Perform the naive summation, but using 80-bit addition. This will
    produce higher accuracy, but limit performance to typically 4
    cycles or so per addition (as does the naive SSE2 approach),
    because the latency of the floating-point addition is 4 cycles or
    so (depending on the actual processor).

    B) Perform vectorized summation using SIMD instructions (e.g.,
    AVX-512), with enough parallel additions (beyond the vector size)
    that either the load unit throughput, the FPU throughput, or the
    instruction issue rate will limit the performance. Reduce the n
    intermediate results to one intermediate result in the end. If I
    give the naive loop to gcc -O3 and allow it to pretend that
    floating-point addition is associative, it produces such a
    computation automatically. The result will typically be a little
    more accurate than the result of 1), because the length of the
    addition chains is lenth(vector)/lanes+ld(lanes) rather than
    length(vector).

    C) Perform tree addition

    a) Using 80-bit addition. This will be faster than sequential
    addition because in many cases several additions can run in
    parallel. It will also be quite accurate because it uses 80-bit
    addition, and because the addition chains are reduced to
    ld(length(vector)).

    b) Using DP addition. This allows to use SIMD instructions for
    increased performance (except near the root of the tree), but the
    accuracy is not as good as with 80-bit addition. It is still
    good because the length of the addition chains is only
    ld(length(vector)).

    D) Use Kahan summation (you must not allow the compiler to pretend
    that FP addition is associative, or this will not work) or one of
    its enhancements. This provides a very high accuracy, but (in case
    of the original Kahan summation) requires four FP operations for
    each summand, and each operation depends on the previous one. So
    you get the latency of 4 FP additions per iteration for a version
    that goes across the array sequentially. You can apply
    vectorization to eliminate the effect of these latencies, but you
    will still see the increased resource consumption. If the vector
    resides in a distant cache or in main memory, the memory limit may
    limit performance more than lack of FPU resources, however.

    E) Sort the vector, then start with the element closest to 0. At
    every step, add the element of the sign other than the current
    intermediate sum that is closest to 0. If there is no such element
    left, add the remaining elements in order, starting with the one
    closest to 0. This is pretty accurate and slower than naive
    addition. At the current relative costs of sorting and FP
    operations, Kahan summation probably dominates over this approach.


    So, as you can see, depending on your objectives there may be more
    attractive ways to add a vector than what you suggested. Your
    suggestion actually looks pretty unattractive, except if your
    objectives are "ease of implementation" and "more accuracy than the
    naive approach".

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Mon Jul 14 18:13:34 2025
    From Newsgroup: comp.lang.forth

    On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:

    mhx@iae.nl (mhx) writes:
    On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:
    [..]
    The question is: What properties do you want your computation to have?
    [..]
    2) A more accurate result? How much more accuracy?

    3) More performance?

    3) + 2). If the result is more accurate, the condition number of
    matrices should be better, resulting in less LU decomposition
    iterations. However, solving the system matrix normally takes
    less than 20% of the total runtime.

    I've never seen *anybody* worry about the numerical accuracy of
    final simulation results.

    [..]
    C) Perform tree addition

    a) Using 80-bit addition. This will be faster than sequential
    addition because in many cases several additions can run in
    parallel. It will also be quite accurate because it uses 80-bit
    addition, and because the addition chains are reduced to
    ld(length(vector)).

    This looks very interesting. I can find Kahan and Neumaier, but
    "tree addition" didn't turn up (There is a suspicious looking
    reliability paper about the approach which surely is not what
    you meant). Or is it pairwise addition what I should look for?

    So, as you can see, depending on your objectives there may be more
    attractive ways to add a vector than what you suggested. Your
    suggestion actually looks pretty unattractive, except if your
    objectives are "ease of implementation" and "more accuracy than the
    naive approach".

    Sure, "ease of implementation" is high on my list too. Life is too
    short.

    Thank you for your wonderful and very useful suggestions.

    -marcel

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jul 16 11:25:04 2025
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:
    C) Perform tree addition

    a) Using 80-bit addition. This will be faster than sequential
    addition because in many cases several additions can run in
    parallel. It will also be quite accurate because it uses 80-bit
    addition, and because the addition chains are reduced to
    ld(length(vector)).

    This looks very interesting. I can find Kahan and Neumaier, but
    "tree addition" didn't turn up (There is a suspicious looking
    reliability paper about the approach which surely is not what
    you meant). Or is it pairwise addition what I should look for?

    Yes, "tree addition" is not a common term, and Wikipedia calls it
    pairwise addition. Except that unlike suggeseted in <https://en.wikipedia.org/wiki/Pairwise_summation> I would not switch to
    a sequential approach for small n, for both accuracy and performance.
    In any case the idea is to turn the evaluation tree from a degenerate
    tree into a balanced tree. E.g., if you add up a, b, c, and d, then
    the naive evaluation

    a b f+ c f+ d f+

    has the evaluation tree

    a b
    \ /
    f+ c
    \ /
    f+ d
    \ /
    f+

    with the three F+ each depending on the previous one, and also
    increasing the rounding errors. If you balance the tree

    a b c d
    \ / \ /
    f+ f+
    \ /
    f+

    corresponding to

    a b f+ c d f+ f+

    the first two f+ can run in parallel (increasing performance), and the
    rounding errors tend to be less.

    So how to implement this for an arbitrary N? We had an extensive
    discussion of a similar problem in the thread on the subject "balanced
    REDUCE: a challenge for the brave", and you can find that discussion
    at <https://comp.lang.forth.narkive.com/GIg9V9HK/balanced-reduce-a-challenge-for-the-brave>

    But I decided to use a recursive approach (recursive-sum, REC) that
    uses the largest 2^k<n as the left child and the rest as the right
    child, and as base cases for the recursion use a straight-line
    balanced-tree evaluation for 2^k with k<=7 (and combine these for n
    that are not 2^k). For systems with tiny FP stacks, I added the
    option to save intermediate results on a software stack in the
    recursive word. Concerning the straight-line code, it turned out that
    the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
    FP stack items); it's not clear to me why; on lxf I can use k=7 (and
    it uses the 387 stack, too).

    I also coded the shift-reduce-sum algorithm (shift-reduce-sum, SR)
    described in <https://en.wikipedia.org/wiki/Pairwise_summation> in
    Forth, because it can make use of Forth's features (such as the FP
    stack) where the C code has to hand-code it. It uses the FP stack
    beyond 8 elements if there are more than 128 elements in the array, so
    it does not work for the benchmark (with 100_000 elements in the
    array) on lxf, sf64, and vfx64. As you will see, this is no loss.

    I also coded the naive, sequential approach (naive-sum, NAI).

    One might argue that the straight-line stuff in REC puts REC at an
    advantage, so i also produced an unrolled version of the naive code (unrolled-sum, UNR) that uses straight-line sequences for adding up to
    2^7 elements to the intermediate result.

    You can find a file containing all these versions, compatibility
    configurations for various Forth systems, and testing and benchmarking
    code and data, on

    https://www.complang.tuwien.ac.at/forth/programs/pairwise-sum.4th

    I did not do any accuracy measurements, but I did performance
    measurements on a Ryzen 5800X:

    cycles:u
    gforth-fast iforth lxf SwiftForth VFX
    3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI
    6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR
    3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC
    9_150_679_812 14_634_786_781 SR

    cycles:u
    gforth-fast iforth lxf SwiftForth VFX 13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
    6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR
    9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC 51_113_853_111 29_264_267_850 SR

    The versions used are:
    Gforth 0.7.9_20250625
    iForth 5.1-mini
    lxf 1.7-172-983
    SwiftForth x64-Linux 4.0.0-RC89
    VFX Forth 64 5.43 [build 0199] 2023-11-09

    The ":u" means that I measured what happened at the user-level, not at
    the kernel-level.

    Each benchmark run performs 1G f@ and f+ operations, and the naive
    approach performs 1G iterations of the loop.

    The NAIve and UNRolled results show that performance in both is
    limited by the latency of the F+: 3 cycles for the DP SSE2 operation
    in Gforth-fast, 6 cycles for the 80-bit 387 fadd on the other systems.
    It's unclear to me why UNR is much slower on gforth-fast compared to
    NAI.

    The RECursive balanced-tree sum is faster on iForth, lxf and VFX than
    the NAIve and UNRolled versions. It is slower on Gforth: My guess is
    that, despite all hardware advances, the lack of multi-state stack
    caching in Gforth means that the hardware of the Ryzen 5800X does not
    just see the real data flow, but a lot of additional dependences; or
    it may be related to whatever causes the slowdown for UNRolled.

    The SR (shift-reduce) sum looks cute, but performs so many additional instructions, even on iForth, that it is uncompetetive. It's unclear
    to me what slows it down so much on iForth, however.

    I expect that vectorized implementations using AVX will be several
    times faster than the fastest scalar stuff we see here.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jul 16 15:39:26 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I did not do any accuracy measurements, but I did performance
    measurements on a Ryzen 5800X:

    cycles:u
    gforth-fast iforth lxf SwiftForth VFX 3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI 6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR 3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC 9_150_679_812 14_634_786_781 SR

    cycles:u

    This second table is about instructions:u

    gforth-fast iforth lxf SwiftForth VFX
    13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
    6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR 9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
    51_113_853_111 29_264_267_850 SR

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jul 16 16:02:41 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    But I decided to use a recursive approach (recursive-sum, REC) that
    uses the largest 2^k<n as the left child and the rest as the right
    child, and as base cases for the recursion use a straight-line
    balanced-tree evaluation for 2^k with k<=7 (and combine these for n
    that are not 2^k). For systems with tiny FP stacks, I added the
    option to save intermediate results on a software stack in the
    recursive word. Concerning the straight-line code, it turned out that
    the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
    FP stack items); it's not clear to me why; on lxf I can use k=7 (and
    it uses the 387 stack, too).

    Actually, after writing that, I found out the reasons for the FP stack overflows, and in the published versions and the results I use k=7 on
    all systems. It's really easy to leave an FP stack item on the FP
    stack while calling another word, and that's not so good if you do it
    while calling sum128:-).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Jul 16 21:12:13 2025
    From Newsgroup: comp.lang.forth

    Well, that is strange ...

    Results with the current iForth are quite different:

    FORTH> bench ( see file quoted above + usual iForth timing words )
    \ 7963 times
    \ naive-sum : 0.999 seconds elapsed. ( 4968257259 )
    \ unrolled-sum : 1.004 seconds elapsed. ( 4968257259 )
    \ recursive-sum : 0.443 seconds elapsed. ( 4968257259 )
    \ shift-reduce-sum : 2.324 seconds elapsed. ( 4968257259 ) ok

    So here recursive-sum is by far the fastest, and shift-reduce-sum
    is not horribly slow. The slowdown in srs is because the 2nd loop
    is using the external stack.

    -marcel

    PS: Because of recent user requests a development snapshot was
    made available at the usual place.

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jul 17 12:41:45 2025
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    Well, that is strange ...

    Results with the current iForth are quite different:

    FORTH> bench ( see file quoted above + usual iForth timing words )
    \ 7963 times
    \ naive-sum : 0.999 seconds elapsed. ( 4968257259 )
    \ unrolled-sum : 1.004 seconds elapsed. ( 4968257259 )
    \ recursive-sum : 0.443 seconds elapsed. ( 4968257259 )
    \ shift-reduce-sum : 2.324 seconds elapsed. ( 4968257259 ) ok

    Assuming that you were also using a Ryzen 5800X (or other Zen3-based
    CPU), running at 4.8GHz, accounting for the different number of
    iteratons, and that basically all the elapsed time is due to user
    cycles of our benchmark, I defined:

    : scale s>f 4.8e9 f/ 10000e f/ 7963e f* ;

    The output should be the approximate number of seconds. Here's what I
    get from the cycle:u numbers for iForth 5.1-mini given in the earlier
    postings:

    \ ------------ input ---------- | output
    6_482_017_334 scale 7 5 3 f.rdp 1.07534 ok
    6_452_716_125 scale 7 5 3 f.rdp 1.07048 ok
    2_949_273_264 scale 7 5 3 f.rdp 0.48927 ok
    14_634_786_781 scale 7 5 3 f.rdp 2.42785 ok

    The resulting numbers are not very different from those you show. My measurements include iForth's startup overhead, which may be one
    explanation why they are a little higher.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Fri Jul 18 05:25:21 2025
    From Newsgroup: comp.lang.forth

    On Thu, 17 Jul 2025 12:41:45 +0000, Anton Ertl wrote:

    mhx@iae.nl (mhx) writes:
    Well, that is strange ...
    [..]
    The output should be the approximate number of seconds. Here's what I
    get from the cycle:u numbers for iForth 5.1-mini given in the earlier postings:

    \ ------------ input ---------- | output
    6_482_017_334 scale 7 5 3 f.rdp 1.07534 ok
    6_452_716_125 scale 7 5 3 f.rdp 1.07048 ok
    2_949_273_264 scale 7 5 3 f.rdp 0.48927 ok
    14_634_786_781 scale 7 5 3 f.rdp 2.42785 ok

    The resulting numbers are not very different from those you show. My measurements include iForth's startup overhead, which may be one
    explanation why they are a little higher.

    You are right, of course. I was confused by the original posting's
    second table (which showed #instructions but was labeled #cycles).

    ( For the record, I used #7963 iterations of the code to get
    approximately 1 second runtime for the naive implementation. )

    -marcel

    --
    --- Synchronet 3.21a-Linux NewsLink 1.2