Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 23 |
Nodes: | 6 (0 / 6) |
Uptime: | 56:49:04 |
Calls: | 584 |
Calls today: | 1 |
Files: | 1,139 |
D/L today: |
179 files (27,921K bytes) |
Messages: | 112,134 |
In article <300ba9a1581bea9a01ab85d5d361e6eaeedbf23a@i2pn2.org>,
dxf <dxforth@gmail.com> wrote:
On 3/07/2025 10:51 pm, albert@spenarnc.xs4all.nl wrote:
...
I like to remind of the youtube FORTH2020 of Wagner. This concerns
motions of aircraft, position speed, pitch roll and yaw etc.
Terribly complicated, no LOCAL's. There was a question whether LOCAL's
could have made Wagners code easier.
He stated the ideal (paraphrased by me) that "code is its own comment"
That was an interesting video even if more a rundown of his (long) history >> as a professional forth programmer. Here's the link for anyone curious:
https://youtu.be/V9ES9UZHaag
He said he uses the hardware fp stack for speed. Is he really only
using 8 levels of stack?
8 level is plenty as long as you refrain from recursion that in
Wagners context would be not even remotely useful.
In article <nnd$57e17bcd$463b2e07@d86e5bbc05746f06>,
Hans Bezemer <the.beez.speaks@gmail.com> wrote:
On 03-07-2025 01:59, Paul Rubin wrote:
Hans Bezemer <the.beez.speaks@gmail.com> writes:
1. Adding general locals is trivial. It takes just one single line of
Forth.
I don't see how to do it in one line, and trivial is a subjective term.
I'd say in any case that it's not too difficult, but one line seems
overoptimistic. Particularly, you need something like (LOCAL) in the
VM. The rest is just some extensions to the colon compiler. Your
mention of it taking 3-4 screens sounded within reason to me, and I
don't consider that to be a lot of code.
"Short" in my dictionary is. One. Single. Screen. No more. No less (pun
intended).
And this one is one single screen. Even with the dependencies.
https://youtu.be/FH4tWf9vPrA
Typical use:
variable a
variable b
: divide
local a
local b
b ! a ! a @ b @ / ;
Does recursion, the whole enchilada. One line.
Thanks to Fred Behringer - and Albert, who condensed it to a single
single line definition. Praise is where praise is due.
Although 'local variables' like this are much preferred (superior) ,
LOCAL (value) is what is asked for.
If you don't have the akward, forward parsing TO already defined, you
are bound to do more work.
Puzzling because of a thread here not long ago in which scientific users >appear to suggest the opposite. Such concerns have apparently been around
a long time:
https://groups.google.com/g/comp.lang.forth/c/CApt6AiFkxo/m/wwZmc_Tr1PcJ
One interesting aspect is that VFX 5.x finally includes an FP package
by default, and it started by including an SSE2-based FP package which >supports a deep FP stack. However, MPE received customer complaints
about the lower number of significant digits in SSE2 (binary64)
vs. 387 (80-bit FP values), so they switched the default to the
387-based FP package that only has 8 FP stack items. Apparently no
MPE customer complains about that limitation.
OTOH, iForth-5.1-mini uses the 387 instructions, but stores FP stack
items in memory at least on call boundaries. Maybe Marcel Hendrix can
give some insight into what made him take this additional
implementation effort.
- anton--
I investigated the instruction set, and I found no way to detect
if the 8 registers stack is full.
This would offer the possibility to spill registers to memory only
if it is needed.
Am 05.07.2025 um 14:21 schrieb albert@spenarnc.xs4all.nl:
I investigated the instruction set, and I found no way to detect
if the 8 registers stack is full.
This would offer the possibility to spill registers to memory only
if it is needed.
IIRC signaling and handling fp-stack overflow is not an easy task.
At most, the computer would crash.
IOW, spilling makes sense.
Am 05.07.2025 um 14:21 schrieb albert@spenarnc.xs4all.nl:
I investigated the instruction set, and I found no way to detect
if the 8 registers stack is full.
This would offer the possibility to spill registers to memory only
if it is needed.
IIRC signaling and handling fp-stack overflow is not an easy task.
dxf <dxforth@gmail.com> writes:
[8 stack items on the FP stack]
Puzzling because of a thread here not long ago in which scientific users
appear to suggest the opposite. Such concerns have apparently been around >> a long time:
https://groups.google.com/g/comp.lang.forth/c/CApt6AiFkxo/m/wwZmc_Tr1PcJ
I have read through the thread. It's unclear to me which scientific
users you have in mind. My impression is that 8 stack items was
deemed sufficient by many, and preferable (on 387) for efficiency
reasons.
Certainly, of the two points this thread is about, there was a
Forth200x proposal for standardizing a separate FP stack, and this
proposal was accepted. There was no proposal for increasing the
minimum size of the FP stack; Forth-2012 still says:
|The size of a floating-point stack shall be at least 6 items.
One interesting aspect is that VFX 5.x finally includes an FP package
by default, and it started by including an SSE2-based FP package which supports a deep FP stack. However, MPE received customer complaints
about the lower number of significant digits in SSE2 (binary64)
vs. 387 (80-bit FP values), so they switched the default to the
387-based FP package that only has 8 FP stack items. Apparently no
MPE customer complains about that limitation.
...
On 5/07/2025 6:49 pm, Anton Ertl wrote:
dxf <dxforth@gmail.com> writes:
[8 stack items on the FP stack]
Puzzling because of a thread here not long ago in which scientific users >>> appear to suggest the opposite. Such concerns have apparently been around >>> a long time:
https://groups.google.com/g/comp.lang.forth/c/CApt6AiFkxo/m/wwZmc_Tr1PcJ
I have read through the thread. It's unclear to me which scientific
users you have in mind. My impression is that 8 stack items was
deemed sufficient by many, and preferable (on 387) for efficiency
reasons.
AFAICS both Skip Carter (proponent) and Julian Noble were suggesting the
6 level minimum were inadequate. A similar sentiment was expressed here
only several months ago. AFAIK all major forths supporting x87 hardware offer software stack options.
Certainly, of the two points this thread is about, there was a
Forth200x proposal for standardizing a separate FP stack, and this
proposal was accepted. There was no proposal for increasing the
minimum size of the FP stack; Forth-2012 still says:
|The size of a floating-point stack shall be at least 6 items.
Only because nothing further was heard. What became of the review
Elizabeth announced I've no idea.
One interesting aspect is that VFX 5.x finally includes an FP package
by default, and it started by including an SSE2-based FP package which
supports a deep FP stack. However, MPE received customer complaints
about the lower number of significant digits in SSE2 (binary64)
vs. 387 (80-bit FP values), so they switched the default to the
387-based FP package that only has 8 FP stack items. Apparently no
MPE customer complains about that limitation.
...
AFAIK x87 hardware stack was always MPE's main and best supported FP
package. As for SSE2 it wouldn't exist if industry didn't consider double-precision adequate. My impression of MPE's SSE2 implementation
is that it's 'a work in progress'. The basic precision is there but transcendentals appear to be limited to single-precision. That'd be
the reason I'd stick with MPE's x87 package. Other reason is it's now
quite difficult and error-prone to switch FP packages as it involves rebuilding the system. The old scheme was simpler and idiot-proof.
On 6 Jul 2025 at 04:52:37 CEST, "dxf" <dxforth@gmail.com> wrote:
On 5/07/2025 6:49 pm, Anton Ertl wrote:
...
One interesting aspect is that VFX 5.x finally includes an FP package
by default, and it started by including an SSE2-based FP package which
supports a deep FP stack. However, MPE received customer complaints
about the lower number of significant digits in SSE2 (binary64)
vs. 387 (80-bit FP values), so they switched the default to the
387-based FP package that only has 8 FP stack items. Apparently no
MPE customer complains about that limitation.
...
AFAIK x87 hardware stack was always MPE's main and best supported FP
package. As for SSE2 it wouldn't exist if industry didn't consider
double-precision adequate. My impression of MPE's SSE2 implementation
is that it's 'a work in progress'. The basic precision is there but
transcendentals appear to be limited to single-precision. That'd be
the reason I'd stick with MPE's x87 package. Other reason is it's now
quite difficult and error-prone to switch FP packages as it involves
rebuilding the system. The old scheme was simpler and idiot-proof.
You do not have to rebuild the system to switch. Just read the manual.
Recently someone told me about Christianity - how it wasn't meant to be...
easy - supposed to be, among other things, a denial of the senses.
On 5/07/2025 6:49 pm, Anton Ertl wrote:
dxf <dxforth@gmail.com> writes:
[8 stack items on the FP stack]
Puzzling because of a thread here not long ago in which scientific users >>> appear to suggest the opposite. Such concerns have apparently been around >>> a long time:
https://groups.google.com/g/comp.lang.forth/c/CApt6AiFkxo/m/wwZmc_Tr1PcJ
I have read through the thread. It's unclear to me which scientific
users you have in mind. My impression is that 8 stack items was
deemed sufficient by many, and preferable (on 387) for efficiency
reasons.
AFAICS both Skip Carter (proponent) and Julian Noble were suggesting the
6 level minimum were inadequate.
AFAIK all major forths supporting x87 hardware
offer software stack options.
Certainly, of the two points this thread is about, there was a
Forth200x proposal for standardizing a separate FP stack, and this
proposal was accepted. There was no proposal for increasing the
minimum size of the FP stack; Forth-2012 still says:
|The size of a floating-point stack shall be at least 6 items.
Only because nothing further was heard. What became of the review
Elizabeth announced I've no idea.
The old scheme was simpler and idiot-proof.
dxf <dxforth@gmail.com> writes:
On 5/07/2025 6:49 pm, Anton Ertl wrote:
dxf <dxforth@gmail.com> writes:
[8 stack items on the FP stack]
Puzzling because of a thread here not long ago in which scientific users >>>> appear to suggest the opposite. Such concerns have apparently been around >>>> a long time:I have read through the thread. It's unclear to me which scientific
https://groups.google.com/g/comp.lang.forth/c/CApt6AiFkxo/m/wwZmc_Tr1PcJ >>>
users you have in mind. My impression is that 8 stack items was
deemed sufficient by many, and preferable (on 387) for efficiency
reasons.
AFAICS both Skip Carter (proponent) and Julian Noble were suggesting the
6 level minimum were inadequate.
Skip Carter did not post in this thread, but given that he proposed
the change, he probably found 6 to be too few; or maybe it was just a phenomenon that we also see elsewhere as range anxiety. In any case,
he made no such proposal to Forth-200x, so apparently the need was not pressing.
Julian Noble ignored the FP stack size issue in his first posting in
this thread, unlike the separate FP stack size issue, which he
supported. So it seems that he did not care about a larger FP stack
size. In the other posting he endorsed moving FP stack items to the
data stack, but he did not write why; for all we know he might have
wanted that as a first step for getting the mantissa, exponent and
sign of the FP value as integer (and the other direction for
synthesizing FP numbers from these parts).
AFAIK all major forths supporting x87 hardware
offer software stack options.
Certainly on SwiftForth-4.0 I find no such option, it apparently
proved unnecessary. The manual mentions fpconfig.f, but no such file
exists in a SwiftForth-4.0 directory in the versions I have installed.
There exists such a file on various SwiftForth-3.x versions, but on
most of our machines SwiftForth-3.x segfaults (I have not investigated
why; it used to work). Ok, so I found an old system where it does not segfault, but trying to load FP on that system produced no joy:
...
...
If I want to switch from the default FP package to a different
package, I essentially have to take the same steps, I only have to add
two additional commands before including the FP package; the last
command for including the SSE implementation becomes:
vfx64 "integers remove-FP-pack include /nfs/nfstmp/anton/VfxForth64Lin-5.43/Lib/x64/FPSSE64S.fth"
(A special twist here is that the documentation says that the file is
called FPSSE64.fth (with only 2 S characters), so I needed a few more
locate invocations to find the right one).
If you find the former simple, why not the latter (apart from the documentation mistake)?
In any case, in almost all cases I use the default FP pack, and here
the VFX-5 and SwiftForth-4 approach is unbeatable in simplicity.
Instead of performing the sequence of commands shown above, I just
start the Forth system, and FP words are ready.
Skip Carter did not post in this thread, but given that he proposed
the change, he probably found 6 to be too few; or maybe it was just a >phenomenon that we also see elsewhere as range anxiety. In any case,
he made no such proposal to Forth-200x, so apparently the need was not >pressing.
In any case, in almost all cases I use the default FP pack, and here
the VFX-5 and SwiftForth-4 approach is unbeatable in simplicity.
Instead of performing the sequence of commands shown above, I just
start the Forth system, and FP words are ready.
- anton
On 6/07/2025 9:30 pm, Anton Ertl wrote:
dxf <dxforth@gmail.com> writes:
On 5/07/2025 6:49 pm, Anton Ertl wrote:
dxf <dxforth@gmail.com> writes:
[8 stack items on the FP stack]
Puzzling because of a thread here not long ago in which scientific users >>>>> appear to suggest the opposite. Such concerns have apparently been aroundI have read through the thread. It's unclear to me which scientific
a long time:
https://groups.google.com/g/comp.lang.forth/c/CApt6AiFkxo/m/wwZmc_Tr1PcJ >>>>
users you have in mind. My impression is that 8 stack items was
deemed sufficient by many, and preferable (on 387) for efficiency
reasons.
AFAICS both Skip Carter (proponent) and Julian Noble were suggesting the >>> 6 level minimum were inadequate.
Skip Carter did not post in this thread, but given that he proposed
the change, he probably found 6 to be too few; or maybe it was just a
phenomenon that we also see elsewhere as range anxiety. In any case,
he made no such proposal to Forth-200x, so apparently the need was not
pressing.
Julian Noble ignored the FP stack size issue in his first posting in
this thread, unlike the separate FP stack size issue, which he
supported. So it seems that he did not care about a larger FP stack
size. In the other posting he endorsed moving FP stack items to the
data stack, but he did not write why; for all we know he might have
wanted that as a first step for getting the mantissa, exponent and
sign of the FP value as integer (and the other direction for
synthesizing FP numbers from these parts).
He appears to dislike the idea of standard-imposed minimums (e.g. Carter's suggestion of 16) but suggested:
a) the user can offload to memory if necessary from
fpu hardware;
b) an ANS FLOATING and FLOATING EXT wordset includes
the necessary hooks to extend the fp stack.
On 07-07-2025 05:48, dxf wrote:
...
He appears to dislike the idea of standard-imposed minimums (e.g. Carter's >> suggestion of 16) but suggested:
-a-a a) the user can offload to memory if necessary from
-a-a fpu hardware;
-a-a b) an ANS FLOATING and FLOATING EXT wordset includes
-a-a the necessary hooks to extend the fp stack.
In 4tH, there are two (highlevel) FP-systems - with 6 predetermined configurations. Configs number 0-2 don't have an FP stack, they use the datastack. 3-5 have a separate FP stack - and double the precision. The standard FP stacksize is 16, you can extend it by defining a constant before including the FP libs.
I don't do parallelization, but I was still surprised by the good
results using FMA. In other words, increasing floating-point number
size is not always the way to go.
Anyhow, first step is to select the best fp rounding method ....
dxf <dxforth@gmail.com> writes:
As for SSE2 it wouldn't exist if industry didn't consider
double-precision adequate.
SSE2 is/was first and foremost a vectorizing extension, and it has been >superseded quite a few times, indicating it was never all that
adequate.
I don't know whether any of its successors support extended
precision though.
W. Kahan was a big believer in extended precision (that's why the 8087
had it from the start). I believes IEEE specifies both 80 bit and 128
bit formats in addition to 64 bit.
I suspect IEEE simply standardized what had become common practice among >implementers.
By using 80 bits /internally/ Intel went a long way to
achieving IEEE's spec for double precision.
E.g. doing something as simple as changing
sign of an fp number is a pain when NANs are factored in.
The catch with SSE is there's nothing like FCHS or FABS
so depending on how one implements them, results vary across implementations.
"Industry" can manage well with 32-bit
floats or even smaller with non-standard number formats.
On 10 Jul 2025 at 02:18:50 CEST, "minforth" <minforth@gmx.net> wrote:
"Industry" can manage well with 32-bit
floats or even smaller with non-standard number formats.
My customers beg to differ and some use 128 bit numbers for
their work. In a construction estimate for one runway for the
new Hong Kong airport, the cost difference between a 64 bit FP
calculation and the integer calculation was US 10 million dollars.
This was for pile capping which involves a large quantity of relatively
small differences.
dxf <dxforth@gmail.com> writes:
The catch with SSE is there's nothing like FCHS or FABS
so depending on how one implements them, results vary across implementations.
You can see in Gforth how to implement FNEGATE and FABS with SSE2:
see fnegate
Code fnegate
0x000055e6a78a8274: add $0x8,%rbx
0x000055e6a78a8278: xorpd 0x24d8f(%rip),%xmm15 # 0x55e6a78cd010
0x000055e6a78a8281: mov %r15,%r9
0x000055e6a78a8284: mov (%rbx),%rax
0x000055e6a78a8287: jmp *%rax
end-code
ok
0x55e6a78cd010 16 dump
55E6A78CD010: 00 00 00 00 00 00 00 80 - 00 00 00 00 00 00 00 00
ok
see fabs
Code fabs
0x000055e6a78a84fe: add $0x8,%rbx
0x000055e6a78a8502: andpd 0x24b15(%rip),%xmm15 # 0x55e6a78cd020
0x000055e6a78a850b: mov %r15,%r9
0x000055e6a78a850e: mov (%rbx),%rax
0x000055e6a78a8511: jmp *%rax
end-code
ok
0x55e6a78cd020 16 dump
55E6A78CD020: FF FF FF FF FF FF FF 7F - 00 00 00 00 00 00 00 00
The actual implementation is the xorpd instruction for FNEGATE, and in
the andpd instruction for FABS. The memory locations contain masks:
for FNEGATE only the sign bit is set, for FABS everything but the sign
bit is set.
Sure you can implement FNEGATE and FABS in more complicated ways, but
you can also implement them in more complicated ways if you use the
387 instruction set. Here's an example of more complicated
implementations:
see fnegate
FNEGATE
( 004C4010 4833C0 ) XOR RAX, RAX
( 004C4013 F34D0F7EC8 ) MOVQ XMM9, XMM8
( 004C4018 664C0F6EC0 ) MOVQ XMM8, RAX
( 004C401D F2450F5CC1 ) SUBSD XMM8, XMM9
( 004C4022 C3 ) RET/NEXT
( 19 bytes, 5 instructions )
ok
see fabs
FABS
( 004C40B0 E8FBEFFFFF ) CALL 004C30B0 FS@
( 004C40B5 4885DB ) TEST RBX, RBX
( 004C40B8 488B5D00 ) MOV RBX, [RBP]
( 004C40BC 488D6D08 ) LEA RBP, [RBP+08]
( 004C40C0 0F8D05000000 ) JNL/GE 004C40CB
( 004C40C6 E845FFFFFF ) CALL 004C4010 FNEGATE
( 004C40CB C3 ) RET/NEXT
( 28 bytes, 7 instructions )
I believes IEEE specifies both 80 bit and 128 bit formats in additionNot 80-bit format. binary128 and binary256 are specified.
to 64 bit.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I believes IEEE specifies both 80 bit and 128 bit formats in additionNot 80-bit format. binary128 and binary256 are specified.
to 64 bit.
I see, 80 bits is considered double-extended. "The x87 and Motorola
68881 80-bit formats meet the requirements of the IEEE 754-1985 double extended format,[12] as does the IEEE 754 128-bit binary format." (https://en.wikipedia.org/wiki/Extended_precision)
Interestingly, Kahan's 1997 report on IEEE 754's status does say 80 bit
is specified. But it sounds like that omits some nuance.
https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
Kahan was also overly critical of dynamic Unum/Posit formats.
Time has shown that he was partially wrong: https://spectrum.ieee.org/floating-point-numbers-posits-processor
Am 10.07.2025 um 21:33 schrieb Paul Rubin:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I believes IEEE specifies both 80 bit and 128 bit formats in additionNot 80-bit format.-a binary128 and binary256 are specified.
to 64 bit.
I see, 80 bits is considered double-extended.-a "The x87 and Motorola
68881 80-bit formats meet the requirements of the IEEE 754-1985 double
extended format,[12] as does the IEEE 754 128-bit binary format."
(https://en.wikipedia.org/wiki/Extended_precision)
Interestingly, Kahan's 1997 report on IEEE 754's status does say 80 bit
is specified.-a But it sounds like that omits some nuance.
https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
Kahan was also overly critical of dynamic Unum/Posit formats.
Time has shown that he was partially wrong: https://spectrum.ieee.org/floating-point-numbers-posits-processor
minforth <minforth@gmx.net> writes:
Kahan was also overly critical of dynamic Unum/Posit formats.
Time has shown that he was partially wrong:
https://spectrum.ieee.org/floating-point-numbers-posits-processor
I don't feel qualified to draw a conclusion from this. I wonder what
the numerics community thinks, if there is any consensus. I remember
being dubious of posits when I first heard of them, though Kahan
probably influenced that. I do know that IEEE 754 took a lot of trouble
to avoid undesirable behaviours that never would have occurred to most
of us. No idea how well posits do at that. I guess though, given the continued attention they get, they must be more interesting than I had thought.
I saw one of the posit articles criticizing IEEE 754 because IEEE 754 addition is not always associative. But that is inherent in how
floating point arithmetic works, and I don't see how posit addition can
avoid it. Let a = 1e100, b = -1e100, and c=1. So mathematically,
a+b+c=1. You should get that from (a+b)+c in your favorite floating
point format. But a+(b+c) will almost certainly be 0, without very high precision (300+ bits).
When someone begins with the line it rarely ends well:
"Twenty years ago anarchy threatened floating-point arithmetic."
One floating-point to rule them all.
dxf <dxforth@gmail.com> writes:
When someone begins with the line it rarely ends well:
"Twenty years ago anarchy threatened floating-point arithmetic."
One floating-point to rule them all.
This gives a good perspective on posits:
https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinechin_etal_2019.pdf
Floating point arithmetic in the 1960s (before my time) was really in a terrible state. Kahan has written about it. Apparently IBM 360
floating point arithmetic had to be redesigned after the fact, because
the original version had such weird anomalies.
dxf <dxforth@gmail.com> writes:
When someone begins with the line it rarely ends well:
"Twenty years ago anarchy threatened floating-point arithmetic."
One floating-point to rule them all.
This gives a good perspective on posits:
https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinechin_etal_2019.pdf
Am 10.07.2025 um 21:33 schrieb Paul Rubin:
Kahan was also overly critical of dynamic Unum/Posit formats.
Time has shown that he was partially wrong: >https://spectrum.ieee.org/floating-point-numbers-posits-processor
But was it the case by the mid/late 70's - or certain individuals saw an opportunity to influence the burgeoning microprocessor market? Notions of single and double precision already existed in software floating point -
I have looked at a (IIRC) slide deck by Kahan where he shows examples
where the altenarnative by Gustafson (don't remember which one he
looked at in that slide deck) fails and traditional FP numbers work.
I guess though, given the
continued attention they get, they must be more interesting than I had >thought.
I saw one of the posit articles criticizing IEEE 754 because IEEE 754 >addition is not always associative. But that is inherent in how
floating point arithmetic works, and I don't see how posit addition can
avoid it.
On 11/07/2025 1:17 pm, Paul Rubin wrote:
This gives a good perspective on posits:
https://people.eecs.berkeley.edu/~demmel/ma221_Fall20/Dinechin_etal_2019.pdf
Floating point arithmetic in the 1960s (before my time) was really in a
terrible state. Kahan has written about it. Apparently IBM 360
floating point arithmetic had to be redesigned after the fact, because
the original version had such weird anomalies.
But was it the case by the mid/late 70's - or certain individuals saw an >opportunity to influence the burgeoning microprocessor market?
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I have looked at a (IIRC) slide deck by Kahan where he shows examples
where the altenarnative by Gustafson (don't remember which one he
looked at in that slide deck) fails and traditional FP numbers work.
Maybe this: http://people.eecs.berkeley.edu/~wkahan/UnumSORN.pdf
mhx@iae.nl (mhx) writes:
What is there not to like with the FPU? It provides 80 bits, which
is in itself a useful additional format, and should never have problems
with single and double-precision edge cases.
If you want to do double precision, using the 387 stack has the double-rounding problem <https://en.wikipedia.org/wiki/Rounding#Double_rounding>. Even if you
limit the mantissa to 53 bits, you still get double rounding when you
deal with numbers that are denormal numbers in binary64
representation. Java wanted to give the same results, bit for bit, on
all hardware, and ran afoul of this until they could switch to SSE2.
The only problem is that some languages and companies find it necessary
to boycott FPU use.
The rest of the industry has standardized on binary64 and binary32,
and they prefer bit-equivalent results for ease of testing. So as
soon as SSE2 gave that to them, they flocked to SSE2.
...
On 11/07/2025 8:22 pm, Anton Ertl wrote:
The rest of the industry has standardized on binary64 and binary32,
and they prefer bit-equivalent results for ease of testing. So as
soon as SSE2 gave that to them, they flocked to SSE2.
...
I wonder how much of this is academic or trend inspired?
AFAICS Forth
clients haven't flocked to it else vendors would have SSE2 offerings at
the same level as their x387 packs.
On 13/07/2025 7:01 pm, Anton Ertl wrote:
...
For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system was
the only one with hardware FP for many years, so there probably was
little pressure from users for bit-identical results with, say, SPARC,
because they did not have a Forth system that ran on SPARC.
What do you mean by "bit-identical results"? Since SSE2 comes without >transcendentals (or basics such as FABS and FNEGATE) and implementers
are expected to supply their own, if anything, I expect results across >platforms and compilers to vary.
So just use the same implementations of transcentental functions, and
your results will be bit-identical
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
So just use the same implementations of transcentental functions, and
your results will be bit-identical
Same implementations = same FP operations in the exact same order?
That
seems hard to ensure, if the functions are implemented in a language
that leaves anything up to a compiler.
Also, in the early implementations x87, 68881, NS320something(?), >transcententals were included in the coprocessor and the workings
weren't visible.
This looks very interesting. I can find Kahan and Neumaier, but
"tree addition" didn't turn up (There is a suspicious looking
reliability paper about the approach which surely is not what
you meant). Or is it pairwise addition what I should look for?
dxf <dxforth@gmail.com> writes:
On 13/07/2025 7:01 pm, Anton Ertl wrote:
...
For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system
was the only one with hardware FP for many years, so there
probably was little pressure from users for bit-identical results
with, say, SPARC, because they did not have a Forth system that
ran on SPARC.
What do you mean by "bit-identical results"? Since SSE2 comes
without transcendentals (or basics such as FABS and FNEGATE) and >implementers are expected to supply their own, if anything, I expect >results across platforms and compilers to vary.
There are operations for which IEEE 754 specifies the result to the
last bit (except that AFAIK the representation of NaNs is not
specified exactly), among them F+ F- F* F/ FSQRT, probably also
FNEGATE and FABS. It does not specify the exact result for
transcendental functions, but if your implementation performs the same bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.
- anton
I did not do any accuracy measurements, but I did performanceYMMV but "fast but wrong" would not be my goal. ;-)
measurements
Am 16.07.2025 um 13:25 schrieb Anton Ertl:
I did not do any accuracy measurements, but I did performanceYMMV but "fast but wrong" would not be my goal. ;-)
measurements
minforth <minforth@gmx.net> writes:
Am 16.07.2025 um 13:25 schrieb Anton Ertl:
I did not do any accuracy measurements, but I did performanceYMMV but "fast but wrong" would not be my goal. ;-)
measurements
I did test correctness with cases where roundoff errors do not play a
role.
As mentioned, the RECursive balanced-tree sum (which is also the
fastest on several systems and absolutely) is expected to be more
accurate in those cases where roundoff errors do play a role. But if
you care about that, better design a test and test it yourself. It
will be interesting to see how you find out which result is more
accurate when they differ.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:I have run this test now on my Ryzen 9950X for lxf, lxf64 ans a snapshot of gforth
I did not do any accuracy measurements, but I did performance
measurements on a Ryzen 5800X:
cycles:u
gforth-fast iforth lxf SwiftForth VFX 3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI
6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR
3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC
9_150_679_812 14_634_786_781 SR
cycles:u
This second table is about instructions:u
gforth-fast iforth lxf SwiftForth VFX
13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR
9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
51_113_853_111 29_264_267_850 SR
- anton
Ryzen 9950X
lxf64
5,010,566,495 NAI cycles:u
2,011,359,782 UNR cycles:u
646,926,001 REC cycles:u
3,589,863,082 SR cycles:u
lxf64 =20
7,019,247,519 NAI instructions:u =20
4,128,689,843 UNR instructions:u =20
4,643,499,656 REC instructions:u=20
25,019,182,759 SR instructions:u=20
gforth-fast 20250219
2,048,316,578 NAI cycles:u
7,157,520,448 UNR cycles:u
3,589,638,677 REC cycles:u
17,199,889,916 SR cycles:u
gforth-fast 20250219
13,107,999,739 NAI instructions:u=20
6,789,041,049 UNR instructions:u
9,348,969,966 REC instructions:u=20
50,108,032,223 SR instructions:u=20
lxf
6,005,617,374 NAI cycles:u
6,004,157,635 UNR cycles:u
1,303,627,835 REC cycles:u
9,187,422,499 SR cycles:u
lxf
9,010,888,196 NAI instructions:u
4,237,679,129 UNR instructions:u=20
4,955,258,040 REC instructions:u=20
26,018,680,499 SR instructions:u
lxf uses the x87 builtin fp stack, lxf64 uses sse4 and a large fp stack=20
Meanwhile many years ago, comparative tests were carried out with a
couple of representative archived serial data (~50k samples)
Ultimately, Kahan summation
was the winner. It is slow, but there were no in-the-loop
requirements, so for a background task, Kahan was fast enough.
minforth <minforth@gmx.net> writes:
Meanwhile many years ago, comparative tests were carried out with a
couple of representative archived serial data (~50k samples)
Representative of what? Serial: what series?
peter <peter.noreply@tin.it> writes:
Ryzen 9950X
lxf64
5,010,566,495 NAI cycles:u
2,011,359,782 UNR cycles:u
646,926,001 REC cycles:u
3,589,863,082 SR cycles:u
lxf64 =20
7,019,247,519 NAI instructions:u =20
4,128,689,843 UNR instructions:u =20
4,643,499,656 REC instructions:u=20
25,019,182,759 SR instructions:u=20
gforth-fast 20250219
2,048,316,578 NAI cycles:u
7,157,520,448 UNR cycles:u
3,589,638,677 REC cycles:u
17,199,889,916 SR cycles:u
gforth-fast 20250219
13,107,999,739 NAI instructions:u=20
6,789,041,049 UNR instructions:u
9,348,969,966 REC instructions:u=20
50,108,032,223 SR instructions:u=20
lxf
6,005,617,374 NAI cycles:u
6,004,157,635 UNR cycles:u
1,303,627,835 REC cycles:u
9,187,422,499 SR cycles:u
lxf
9,010,888,196 NAI instructions:u
4,237,679,129 UNR instructions:u=20
4,955,258,040 REC instructions:u=20
26,018,680,499 SR instructions:u
lxf uses the x87 builtin fp stack, lxf64 uses sse4 and a large fp stack=20
Apparently the latency of ADDSD (SSE2) is down to 2 cycles on Zen5
(visible in lxf64 UNR and gforth-fast NAI) while the latency of FADD
(387) is still 6 cycles (lxf NAI and UNR). I have no explanation why
on lxf64 NAI performs so much worse than UNR, and in gforth-fast UNR
so much worse than NAI.
For REC the latency should not play a role. There lxf64 performs at
7.2IPC and 1.55 F+/cycle, whereas lxf performs only at 3.8IPC and 0.77 F+/cycle. My guess is that FADD can only be performed by one FPU, and
that's connected to one dispatch port, and other instructions also
need or are at least assigned to this dispatch port.
- anton
dxf <dxforth@gmail.com> writes:
On 13/07/2025 7:01 pm, Anton Ertl wrote:
...
For Forth, Inc. and MPE AFAIK their respective IA-32 Forth system was
the only one with hardware FP for many years, so there probably was
little pressure from users for bit-identical results with, say, SPARC,
because they did not have a Forth system that ran on SPARC.
What do you mean by "bit-identical results"? Since SSE2 comes without
transcendentals (or basics such as FABS and FNEGATE) and implementers
are expected to supply their own, if anything, I expect results across
platforms and compilers to vary.
There are operations for which IEEE 754 specifies the result to the
last bit (except that AFAIK the representation of NaNs is not
specified exactly), among them F+ F- F* F/ FSQRT, probably also
FNEGATE and FABS. It does not specify the exact result for
transcendental functions, but if your implementation performs the same bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.
So in mandating bit-identical results, not only in calculations but also >input/output
IEEE 754 is all about giving the illusion of truth in
floating-point when, if anything, they should be warning users don't be >fooled.
dxf <dxforth@gmail.com> writes:
So in mandating bit-identical results, not only in calculations but also
input/output
I don't think that IEEE 754 specifies I/O, but I could be wrong.
IEEE 754 is all about giving the illusion of truth in
floating-point when, if anything, they should be warning users don't be
fooled.
I don't think that IEEE 754 mentions truth. It does, however, specify
the inexact "exception" (actually a flag), which allows you to find
out if the results of the computations are exact or if some rounding
was involved.
AFAICS IEEE 754 offers nothing particularly useful for the end-user.
Either one's fp application works - or it doesn't. IEEE hasn't
changed that.
IEEE's relevance is that it spurred Intel into making an FPU which in
turn made implementing fp easy.
dxf <dxforth@gmail.com> writes:
AFAICS IEEE 754 offers nothing particularly useful for the end-user.
Either one's fp application works - or it doesn't. IEEE hasn't
changed that.
The purpose of IEEE FP was to improve the numerical accuracy of
applications that used it as opposed to other formats.
IEEE's relevance is that it spurred Intel into making an FPU which in
turn made implementing fp easy.
Exactly the opposite, Intel decided that it wanted to make an FPU and it wanted the FPU to have the best FP arithmetic possible. So it
commissioned Kahan (a renowned FP expert) to design the FP format.
Kahan said "Why not use the VAX format? It is pretty good". Intel said
it didn't want pretty good, it wanted the best, so Kahan said "ok" and designed the 8087 format.
The IEEE standardization process happened AFTER the 8087 was already in progress. Other manufacturers signed onto it, some of them overcoming initial resistance, after becoming convinced that it was the right
thing.
http://people.eecs.berkeley.edu/~wkahan/ieee754status/754story.html
mhx wrote:
On Sun, 6 Oct 2024 7:51:31 +0000, dxf wrote:
Is there an easier way of doing this? End goal is a double number representing centi-secs.
empty decimal
: SPLIT ( a u c -- a2 u2 a3 u3 ) >r 2dup r> scan 2swap 2 pick - ;
: >INT ( adr len -- u ) 0 0 2swap >number 2drop drop ;
: /T ( a u -- $hour $min $sec )
2 0 do [char] : split 2swap dup if 1 /string then loop
2 0 do dup 0= if 2rot 2rot then loop ;
: .T 2swap 2rot cr >int . ." hr " >int . ." min " >int . ." sec " ;
s" 1:2:3" /t .t
s" 02:03" /t .t
s" 03" /t .t
s" 23:59:59" /t .t
s" 0:00:03" /t .t
Why don't you use the fact that >NUMBER returns the given
string starting with the first unconverted character?
SPLIT should be redundant.
-marcel
: CHAR-NUMERIC? 48 58 WITHIN ;
: SKIP-NON-NUMERIC ( adr u -- adr2 u2)
BEGIN
DUP IF OVER C@ CHAR-NUMERIC? NOT ELSE 0 THEN
WHILE
1 /STRING
REPEAT ;
: SCAN-NEXT-NUMBER ( n adr len -- n2 adr2 len2)
2>R 60 * 0. 2R> >NUMBER
2>R D>S + 2R> ;
: PARSE-TIME ( adr len -- seconds)
0 -ROT
BEGIN
SKIP-NON-NUMERIC
DUP
WHILE
SCAN-NEXT-NUMBER
REPEAT
2DROP ;
S" hello 1::36 world" PARSE-TIME CR .
96 ok
: get-number ( accum adr len -- accum' adr' len' )
{ adr len }
0. adr len >number { adr' len' }
len len' =
if
2drop adr len 1 /string
else
d>s swap 60 * +
adr' len'
then ;
: parse-time ( adr len -- seconds)
0 -rot
begin
dup
while
get-number
repeat
2drop ;
s" foo-bar" parse-time . 0
s" foo55bar" parse-time . 55
s" foo 1 bar 55 zoo" parse-time . 155
...
: get-number ( accum adr len -- accum' adr' len' )
{ adr len }
0. adr len >number { adr' len' }
len len' =
if
2drop adr len 1 /string
else
d>s swap 60 * +
adr' len'
then ;
: parse-time ( adr len -- seconds)
0 -rot
begin
dup
while
get-number
repeat
2drop ;
s" foo-bar" parse-time . 0
s" foo55bar" parse-time . 55
s" foo 1 bar 55 zoo" parse-time . 155
s" and9foo 1 bar 55 zoo" parse-time . 32515