Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Geek2
  Mon Sep 1 08:59:52 2025
  from Euclid, Oh via Telnet
- Geek2
  Sun Aug 31 07:54:36 2025
  from Euclid, Oh via Telnet
- Geek2
  Sat Aug 30 11:27:18 2025
  from Euclid, Oh via Telnet
- Rixter
  Sat Aug 30 08:55:58 2025
  from Madison, Nc via SSH

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	23
Nodes:	6 (0 / 6)
Uptime:	56:58:47
Calls:	584
Calls today:	1
Files:	1,139
D/L today:	179 files (27,921K bytes)
Messages:	112,134

Re: Parsing =?UTF-8?B?dGltZXN0YW1wcz8=?=

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Sun Jul 6 07:50:13 2025

From Newsgroup: comp.lang.forth

Recently someone told me about Christianity - how it wasn't meant to be
easy - supposed to be, among other things, a denial of the senses.

"...I wished to be quite fair then, and I wish to be quite fair now;
and I did not conclude that the attack on Christianity was all wrong.
I only concluded that if Christianity was wrong, it was very
wrong indeed. Such hostile horrors might be combined in one thing,
but that thing must be very strange and solitary. There are men
who are misers, and also spendthrifts; but they are rare. There are
men sensual and also ascetic; but they are rare. But if this mass
of mad contradictions really existed, quakerish and bloodthirsty,
too gorgeous and too thread-bare, austere, yet pandering preposterously
to the lust of the eye, the enemy of women and their foolish refuge,
a solemn pessimist and a silly optimist, if this evil existed,
then there was in this evil something quite supreme and unique.
For I found in my rationalist teachers no explanation of such
exceptional corruption. Christianity (theoretically speaking)
was in their eyes only one of the ordinary myths and errors of mortals.
THEY gave me no key to this twisted and unnatural badness.
Such a paradox of evil rose to the stature of the supernatural.
It was, indeed, almost as supernatural as the infallibility of the Pope.
An historic institution, which never went right, is really quite
as much of a miracle as an institution that cannot go wrong.
The only explanation which immediately occurred to my mind was that Christianity did not come from heaven, but from hell. Really, if Jesus
of Nazareth was not Christ, He must have been Antichrist.

And then in a quiet hour a strange thought struck me like a still thunderbolt. There had suddenly come into my mind another explanation.
Suppose we heard an unknown man spoken of by many men. Suppose we
were puzzled to hear that some men said he was too tall and some
too short; some objected to his fatness, some lamented his leanness;
some thought him too dark, and some too fair. One explanation (as
has been already admitted) would be that he might be an odd shape.
But there is another explanation. He might be the right shape.
Outrageously tall men might feel him to be short. Very short men
might feel him to be tall. Old bucks who are growing stout might
consider him insufficiently filled out; old beaux who were growing
thin might feel that he expanded beyond the narrow lines of elegance.
Perhaps Swedes (who have pale hair like tow) called him a dark man,
while negroes considered him distinctly blonde. Perhaps (in short)
this extraordinary thing is really the ordinary thing; at least
the normal thing, the centre. Perhaps, after all, it is Christianity
that is sane and all its critics that are mad -- in various ways.
I tested this idea by asking myself whether there was about any
of the accusers anything morbid that might explain the accusation.
I was startled to find that this key fitted a lock. For instance,
it was certainly odd that the modern world charged Christianity
at once with bodily austerity and with artistic pomp. But then
it was also odd, very odd, that the modern world itself combined
extreme bodily luxury with an extreme absence of artistic pomp.
The modern man thought Becket's robes too rich and his meals too poor.
But then the modern man was really exceptional in history; no man before
ever ate such elaborate dinners in such ugly clothes. The modern man
found the church too simple exactly where modern life is too complex;
he found the church too gorgeous exactly where modern life is too dingy.
The man who disliked the plain fasts and feasts was mad on entrees.
The man who disliked vestments wore a pair of preposterous trousers.
And surely if there was any insanity involved in the matter at all it
was in the trousers, not in the simply falling robe. If there was any
insanity at all, it was in the extravagant entrees, not in the bread
and wine.

I went over all the cases, and I found the key fitted so far."

--
--- Synchronet 3.21a-Linux NewsLink 1.2

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Mon Jul 7 00:05:14 2025

From Newsgroup: comp.lang.forth

On Sat, 5 Jul 2025 14:24:37 +0000, minforth wrote:

Am 05.07.2025 um 14:41 schrieb minforth:

Am 05.07.2025 um 14:21 schrieb albert@spenarnc.xs4all.nl:

I investigated the instruction set, and I found no way to detect
if the 8 registers stack is full.
This would offer the possibility to spill registers to memory only
if it is needed.

IIRC signaling and handling fp-stack overflow is not an easy task.
At most, the computer would crash.
IOW, spilling makes sense.

A deep dive into the manual

... the C1 condition code flag is used for a variety of functions.
When both the IE and SF flags in the x87 FPU status word are set,
indicating a stack overflow or underflow exception (#IS), the C1
flag distinguishes between overflow (C1=1) and underflow (C1=0).

This definitely does not work (I tried it). That manual is fabulating.

iForth has its FP stack in memory. However, inside colon definitions
the compiler tracks the hardware stack. Only when inlining is not
possible, or would lead to excessive size, HW stack items are
flushed/reloaded to/from memory. Anyway, a software stack is necessary
when calling C libraries or the OS.

The mental cost of writing the FP compiler was insane, but I
found it justified.

BTW, the transputer had the same problem (and solution).

-marcel

--
--- Synchronet 3.21a-Linux NewsLink 1.2

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Fri Jul 11 08:57:09 2025

From Newsgroup: comp.lang.forth

On Fri, 11 Jul 2025 7:55:43 +0000, Paul Rubin wrote:

dxf <dxforth@gmail.com> writes:

But was it the case by the mid/late 70's - or certain individuals saw an
opportunity to influence the burgeoning microprocessor market? Notions
of
single and double precision already existed in software floating point -

Hardware floating point also had single and double precision. The
really awful 1960s systems were gone by the mid 70s. But there were a
lot of competing formats, ranging from bad to mostly-ok. VAX floating
point was mostly ok, DEC wanted IEEE to adopt it, Kahan was ok with
that, but Intel thought "go for the best possible". Kahan's
retrospectives on this stuff are good reading:

What is there not to like with the FPU? It provides 80 bits, which
is in itself a useful additional format, and should never have problems
with single and double-precision edge cases. Plus it does all the
trigonometric and transcendental stuff with a reasonable precision
out-of-box. The instruction set is very regular and quite Forth-like.
The only problem is that some languages and companies find it necessary
to boycott FPU use.

-marcel

--
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Jul 11 10:22:54 2025

From Newsgroup: comp.lang.forth

mhx@iae.nl (mhx) writes:

What is there not to like with the FPU? It provides 80 bits, which
is in itself a useful additional format, and should never have problems
with single and double-precision edge cases.

If you want to do double precision, using the 387 stack has the
double-rounding problem <https://en.wikipedia.org/wiki/Rounding#Double_rounding>. Even if you
limit the mantissa to 53 bits, you still get double rounding when you
deal with numbers that are denormal numbers in binary64
representation. Java wanted to give the same results, bit for bit, on
all hardware, and ran afoul of this until they could switch to SSE2.

The only problem is that some languages and companies find it necessary
to boycott FPU use.

The rest of the industry has standardized on binary64 and binary32,
and they prefer bit-equivalent results for ease of testing. So as
soon as SSE2 gave that to them, they flocked to SSE2.

Another nudge towards binary64 (and binary32) is autovectorization.
You don't want to get different results depending on whether the
compiler manages to auto-vectorize a program (and use SSE2 parallel
(rather than scalar) instructions, AVX, or AVX-512) or not. So you
also use SSE2 when it fails to auto-vectorize.

OTOH, e.g., on gcc you can ask for -mfpmath=386, for -mfpmath=sse or
for -mfpmath=both; or if you define a variable as "long double", it
will store an 80-bit FP value, and computations involving this
variable will be done on the 387.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.21a-Linux NewsLink 1.2

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Mon Jul 14 07:21:45 2025

From Newsgroup: comp.lang.forth

On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:

[..] if your implementation performs the same
bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.

When e.g. summing the elements of a DP vector, it is hard to see why
that couldn't be done on the FPU stack (with 80 bits) before (possibly)
storing the result to a DP variable in memory. I am not sure that Forth
users would be able to resist that approach.

-marcel

--
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Jul 14 07:50:04 2025

From Newsgroup: comp.lang.forth

mhx@iae.nl (mhx) writes:

On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:

[..] if your implementation performs the same
bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental
functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.

When e.g. summing the elements of a DP vector, it is hard to see why
that couldn't be done on the FPU stack (with 80 bits) before (possibly) >storing the result to a DP variable in memory. I am not sure that Forth
users would be able to resist that approach.

The question is: What properties do you want your computation to have?

1) Bit-identical result to a naively-coded IEEE 754 DP computation?

2) A more accurate result? How much more accuracy?

3) More performance?

If you want 1), there is little alternative to actually performing the operations sequentially, using scalar SSE2 operations.

If you can live without 1), there's a wide range of options:

A) Perform the naive summation, but using 80-bit addition. This will
produce higher accuracy, but limit performance to typically 4
cycles or so per addition (as does the naive SSE2 approach),
because the latency of the floating-point addition is 4 cycles or
so (depending on the actual processor).

B) Perform vectorized summation using SIMD instructions (e.g.,
AVX-512), with enough parallel additions (beyond the vector size)
that either the load unit throughput, the FPU throughput, or the
instruction issue rate will limit the performance. Reduce the n
intermediate results to one intermediate result in the end. If I
give the naive loop to gcc -O3 and allow it to pretend that
floating-point addition is associative, it produces such a
computation automatically. The result will typically be a little
more accurate than the result of 1), because the length of the
addition chains is lenth(vector)/lanes+ld(lanes) rather than
length(vector).

C) Perform tree addition

a) Using 80-bit addition. This will be faster than sequential
addition because in many cases several additions can run in
parallel. It will also be quite accurate because it uses 80-bit
addition, and because the addition chains are reduced to
ld(length(vector)).

b) Using DP addition. This allows to use SIMD instructions for
increased performance (except near the root of the tree), but the
accuracy is not as good as with 80-bit addition. It is still
good because the length of the addition chains is only
ld(length(vector)).

D) Use Kahan summation (you must not allow the compiler to pretend
that FP addition is associative, or this will not work) or one of
its enhancements. This provides a very high accuracy, but (in case
of the original Kahan summation) requires four FP operations for
each summand, and each operation depends on the previous one. So
you get the latency of 4 FP additions per iteration for a version
that goes across the array sequentially. You can apply
vectorization to eliminate the effect of these latencies, but you
will still see the increased resource consumption. If the vector
resides in a distant cache or in main memory, the memory limit may
limit performance more than lack of FPU resources, however.

E) Sort the vector, then start with the element closest to 0. At
every step, add the element of the sign other than the current
intermediate sum that is closest to 0. If there is no such element
left, add the remaining elements in order, starting with the one
closest to 0. This is pretty accurate and slower than naive
addition. At the current relative costs of sorting and FP
operations, Kahan summation probably dominates over this approach.

So, as you can see, depending on your objectives there may be more
attractive ways to add a vector than what you suggested. Your
suggestion actually looks pretty unattractive, except if your
objectives are "ease of implementation" and "more accuracy than the
naive approach".

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.21a-Linux NewsLink 1.2

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Mon Jul 14 18:13:34 2025

From Newsgroup: comp.lang.forth

On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:

mhx@iae.nl (mhx) writes:

On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:

[..]

The question is: What properties do you want your computation to have?

[..]

2) A more accurate result? How much more accuracy?

3) More performance?

3) + 2). If the result is more accurate, the condition number of
matrices should be better, resulting in less LU decomposition
iterations. However, solving the system matrix normally takes
less than 20% of the total runtime.

I've never seen *anybody* worry about the numerical accuracy of
final simulation results.

[..]

C) Perform tree addition

a) Using 80-bit addition. This will be faster than sequential
addition because in many cases several additions can run in
parallel. It will also be quite accurate because it uses 80-bit
addition, and because the addition chains are reduced to
ld(length(vector)).

This looks very interesting. I can find Kahan and Neumaier, but
"tree addition" didn't turn up (There is a suspicious looking
reliability paper about the approach which surely is not what
you meant). Or is it pairwise addition what I should look for?

So, as you can see, depending on your objectives there may be more
attractive ways to add a vector than what you suggested. Your
suggestion actually looks pretty unattractive, except if your
objectives are "ease of implementation" and "more accuracy than the
naive approach".

Sure, "ease of implementation" is high on my list too. Life is too
short.

Thank you for your wonderful and very useful suggestions.

-marcel

--
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jul 16 11:25:04 2025

From Newsgroup: comp.lang.forth

mhx@iae.nl (mhx) writes:

On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:

C) Perform tree addition

a) Using 80-bit addition. This will be faster than sequential
addition because in many cases several additions can run in
parallel. It will also be quite accurate because it uses 80-bit
addition, and because the addition chains are reduced to
ld(length(vector)).

This looks very interesting. I can find Kahan and Neumaier, but
"tree addition" didn't turn up (There is a suspicious looking
reliability paper about the approach which surely is not what
you meant). Or is it pairwise addition what I should look for?

Yes, "tree addition" is not a common term, and Wikipedia calls it
pairwise addition. Except that unlike suggeseted in <https://en.wikipedia.org/wiki/Pairwise_summation> I would not switch to
a sequential approach for small n, for both accuracy and performance.
In any case the idea is to turn the evaluation tree from a degenerate
tree into a balanced tree. E.g., if you add up a, b, c, and d, then
the naive evaluation

a b f+ c f+ d f+

has the evaluation tree

a b
\ /
f+ c
\ /
f+ d
\ /
f+

with the three F+ each depending on the previous one, and also
increasing the rounding errors. If you balance the tree

a b c d
\ / \ /
f+ f+
\ /
f+

corresponding to

a b f+ c d f+ f+

the first two f+ can run in parallel (increasing performance), and the
rounding errors tend to be less.

So how to implement this for an arbitrary N? We had an extensive
discussion of a similar problem in the thread on the subject "balanced
REDUCE: a challenge for the brave", and you can find that discussion
at <https://comp.lang.forth.narkive.com/GIg9V9HK/balanced-reduce-a-challenge-for-the-brave>

But I decided to use a recursive approach (recursive-sum, REC) that
uses the largest 2^k<n as the left child and the rest as the right
child, and as base cases for the recursion use a straight-line
balanced-tree evaluation for 2^k with k<=7 (and combine these for n
that are not 2^k). For systems with tiny FP stacks, I added the
option to save intermediate results on a software stack in the
recursive word. Concerning the straight-line code, it turned out that
the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
FP stack items); it's not clear to me why; on lxf I can use k=7 (and
it uses the 387 stack, too).

I also coded the shift-reduce-sum algorithm (shift-reduce-sum, SR)
described in <https://en.wikipedia.org/wiki/Pairwise_summation> in
Forth, because it can make use of Forth's features (such as the FP
stack) where the C code has to hand-code it. It uses the FP stack
beyond 8 elements if there are more than 128 elements in the array, so
it does not work for the benchmark (with 100_000 elements in the
array) on lxf, sf64, and vfx64. As you will see, this is no loss.

I also coded the naive, sequential approach (naive-sum, NAI).

One might argue that the straight-line stuff in REC puts REC at an
advantage, so i also produced an unrolled version of the naive code (unrolled-sum, UNR) that uses straight-line sequences for adding up to
2^7 elements to the intermediate result.

You can find a file containing all these versions, compatibility
configurations for various Forth systems, and testing and benchmarking
code and data, on

https://www.complang.tuwien.ac.at/forth/programs/pairwise-sum.4th

I did not do any accuracy measurements, but I did performance
measurements on a Ryzen 5800X:

cycles:u
gforth-fast iforth lxf SwiftForth VFX
3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI
6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR
3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC
9_150_679_812 14_634_786_781 SR

cycles:u
gforth-fast iforth lxf SwiftForth VFX 13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR
9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC 51_113_853_111 29_264_267_850 SR

The versions used are:
Gforth 0.7.9_20250625
iForth 5.1-mini
lxf 1.7-172-983
SwiftForth x64-Linux 4.0.0-RC89
VFX Forth 64 5.43 [build 0199] 2023-11-09

The ":u" means that I measured what happened at the user-level, not at
the kernel-level.

Each benchmark run performs 1G f@ and f+ operations, and the naive
approach performs 1G iterations of the loop.

The NAIve and UNRolled results show that performance in both is
limited by the latency of the F+: 3 cycles for the DP SSE2 operation
in Gforth-fast, 6 cycles for the 80-bit 387 fadd on the other systems.
It's unclear to me why UNR is much slower on gforth-fast compared to
NAI.

The RECursive balanced-tree sum is faster on iForth, lxf and VFX than
the NAIve and UNRolled versions. It is slower on Gforth: My guess is
that, despite all hardware advances, the lack of multi-state stack
caching in Gforth means that the hardware of the Ryzen 5800X does not
just see the real data flow, but a lot of additional dependences; or
it may be related to whatever causes the slowdown for UNRolled.

The SR (shift-reduce) sum looks cute, but performs so many additional instructions, even on iForth, that it is uncompetetive. It's unclear
to me what slows it down so much on iForth, however.

I expect that vectorized implementations using AVX will be several
times faster than the fastest scalar stuff we see here.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jul 16 15:39:26 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

I did not do any accuracy measurements, but I did performance
measurements on a Ryzen 5800X:

cycles:u
gforth-fast iforth lxf SwiftForth VFX 3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI 6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR 3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC 9_150_679_812 14_634_786_781 SR

cycles:u

This second table is about instructions:u

gforth-fast iforth lxf SwiftForth VFX
13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR 9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
51_113_853_111 29_264_267_850 SR

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Jul 16 16:02:41 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

But I decided to use a recursive approach (recursive-sum, REC) that
uses the largest 2^k<n as the left child and the rest as the right
child, and as base cases for the recursion use a straight-line
balanced-tree evaluation for 2^k with k<=7 (and combine these for n
that are not 2^k). For systems with tiny FP stacks, I added the
option to save intermediate results on a software stack in the
recursive word. Concerning the straight-line code, it turned out that
the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
FP stack items); it's not clear to me why; on lxf I can use k=7 (and
it uses the 387 stack, too).

Actually, after writing that, I found out the reasons for the FP stack overflows, and in the published versions and the results I use k=7 on
all systems. It's really easy to leave an FP stack item on the FP
stack while calling another word, and that's not so good if you do it
while calling sum128:-).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
--- Synchronet 3.21a-Linux NewsLink 1.2

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Jul 16 21:12:13 2025

From Newsgroup: comp.lang.forth

Well, that is strange ...

Results with the current iForth are quite different:

FORTH> bench ( see file quoted above + usual iForth timing words )
\ 7963 times
\ naive-sum : 0.999 seconds elapsed. ( 4968257259 )
\ unrolled-sum : 1.004 seconds elapsed. ( 4968257259 )
\ recursive-sum : 0.443 seconds elapsed. ( 4968257259 )
\ shift-reduce-sum : 2.324 seconds elapsed. ( 4968257259 ) ok

So here recursive-sum is by far the fastest, and shift-reduce-sum
is not horribly slow. The slowdown in srs is because the 2nd loop
is using the external stack.

-marcel

PS: Because of recent user requests a development snapshot was
made available at the usual place.

--
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Jul 17 12:41:45 2025

From Newsgroup: comp.lang.forth

mhx@iae.nl (mhx) writes:

Well, that is strange ...

Results with the current iForth are quite different:

FORTH> bench ( see file quoted above + usual iForth timing words )
\ 7963 times
\ naive-sum : 0.999 seconds elapsed. ( 4968257259 )
\ unrolled-sum : 1.004 seconds elapsed. ( 4968257259 )
\ recursive-sum : 0.443 seconds elapsed. ( 4968257259 )
\ shift-reduce-sum : 2.324 seconds elapsed. ( 4968257259 ) ok

Assuming that you were also using a Ryzen 5800X (or other Zen3-based
CPU), running at 4.8GHz, accounting for the different number of
iteratons, and that basically all the elapsed time is due to user
cycles of our benchmark, I defined:

: scale s>f 4.8e9 f/ 10000e f/ 7963e f* ;

The output should be the approximate number of seconds. Here's what I
get from the cycle:u numbers for iForth 5.1-mini given in the earlier
postings:

\ ------------ input ---------- | output
6_482_017_334 scale 7 5 3 f.rdp 1.07534 ok
6_452_716_125 scale 7 5 3 f.rdp 1.07048 ok
2_949_273_264 scale 7 5 3 f.rdp 0.48927 ok
14_634_786_781 scale 7 5 3 f.rdp 2.42785 ok

The resulting numbers are not very different from those you show. My measurements include iForth's startup overhead, which may be one
explanation why they are a little higher.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
--- Synchronet 3.21a-Linux NewsLink 1.2

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Fri Jul 18 05:25:21 2025

From Newsgroup: comp.lang.forth

On Thu, 17 Jul 2025 12:41:45 +0000, Anton Ertl wrote:

mhx@iae.nl (mhx) writes:

Well, that is strange ...

[..]

The output should be the approximate number of seconds. Here's what I
get from the cycle:u numbers for iForth 5.1-mini given in the earlier postings:

\ ------------ input ---------- | output
6_482_017_334 scale 7 5 3 f.rdp 1.07534 ok
6_452_716_125 scale 7 5 3 f.rdp 1.07048 ok
2_949_273_264 scale 7 5 3 f.rdp 0.48927 ok
14_634_786_781 scale 7 5 3 f.rdp 2.42785 ok

The resulting numbers are not very different from those you show. My measurements include iForth's startup overhead, which may be one
explanation why they are a little higher.

You are right, of course. I was confused by the original posting's
second table (which showed #instructions but was labeled #cycles).

( For the record, I used #7963 iterations of the code to get
approximately 1 second runtime for the naive implementation. )

-marcel

--
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

Recent Visitors

System Info

Re: Parsing =?UTF-8?B?dGltZXN0YW1wcz8=?=