Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 23 |
Nodes: | 6 (0 / 6) |
Uptime: | 56:58:47 |
Calls: | 584 |
Calls today: | 1 |
Files: | 1,139 |
D/L today: |
179 files (27,921K bytes) |
Messages: | 112,134 |
Recently someone told me about Christianity - how it wasn't meant to be
easy - supposed to be, among other things, a denial of the senses.
Am 05.07.2025 um 14:41 schrieb minforth:
Am 05.07.2025 um 14:21 schrieb albert@spenarnc.xs4all.nl:
I investigated the instruction set, and I found no way to detect
if the 8 registers stack is full.
This would offer the possibility to spill registers to memory only
if it is needed.
IIRC signaling and handling fp-stack overflow is not an easy task.
At most, the computer would crash.
IOW, spilling makes sense.
A deep dive into the manual
... the C1 condition code flag is used for a variety of functions.
When both the IE and SF flags in the x87 FPU status word are set,
indicating a stack overflow or underflow exception (#IS), the C1
flag distinguishes between overflow (C1=1) and underflow (C1=0).
dxf <dxforth@gmail.com> writes:
But was it the case by the mid/late 70's - or certain individuals saw an
opportunity to influence the burgeoning microprocessor market? Notions
of
single and double precision already existed in software floating point -
Hardware floating point also had single and double precision. The
really awful 1960s systems were gone by the mid 70s. But there were a
lot of competing formats, ranging from bad to mostly-ok. VAX floating
point was mostly ok, DEC wanted IEEE to adopt it, Kahan was ok with
that, but Intel thought "go for the best possible". Kahan's
retrospectives on this stuff are good reading:
What is there not to like with the FPU? It provides 80 bits, which
is in itself a useful additional format, and should never have problems
with single and double-precision edge cases.
The only problem is that some languages and companies find it necessary
to boycott FPU use.
[..] if your implementation performs the same
bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.
On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:
[..] if your implementation performs the same
bit-exact operations for computing a transcendental function on two
IEEE 754 compliant platforms, the result will be bit-identical (if it
is a number). So just use the same implementations of transcentental
functions, and your results will be bit-identical; concerning the
NaNs, if you find a difference, check if the involved values are NaNs.
When e.g. summing the elements of a DP vector, it is hard to see why
that couldn't be done on the FPU stack (with 80 bits) before (possibly) >storing the result to a DP variable in memory. I am not sure that Forth
users would be able to resist that approach.
mhx@iae.nl (mhx) writes:[..]
On Mon, 14 Jul 2025 6:04:13 +0000, Anton Ertl wrote:
The question is: What properties do you want your computation to have?[..]
2) A more accurate result? How much more accuracy?
3) More performance?
C) Perform tree addition
a) Using 80-bit addition. This will be faster than sequential
addition because in many cases several additions can run in
parallel. It will also be quite accurate because it uses 80-bit
addition, and because the addition chains are reduced to
ld(length(vector)).
So, as you can see, depending on your objectives there may be more
attractive ways to add a vector than what you suggested. Your
suggestion actually looks pretty unattractive, except if your
objectives are "ease of implementation" and "more accuracy than the
naive approach".
On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:
C) Perform tree addition
a) Using 80-bit addition. This will be faster than sequential
addition because in many cases several additions can run in
parallel. It will also be quite accurate because it uses 80-bit
addition, and because the addition chains are reduced to
ld(length(vector)).
This looks very interesting. I can find Kahan and Neumaier, but
"tree addition" didn't turn up (There is a suspicious looking
reliability paper about the approach which surely is not what
you meant). Or is it pairwise addition what I should look for?
I did not do any accuracy measurements, but I did performance
measurements on a Ryzen 5800X:
cycles:u
gforth-fast iforth lxf SwiftForth VFX 3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI 6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR 3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC 9_150_679_812 14_634_786_781 SR
cycles:u
gforth-fast iforth lxf SwiftForth VFX
13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR 9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
51_113_853_111 29_264_267_850 SR
But I decided to use a recursive approach (recursive-sum, REC) that
uses the largest 2^k<n as the left child and the rest as the right
child, and as base cases for the recursion use a straight-line
balanced-tree evaluation for 2^k with k<=7 (and combine these for n
that are not 2^k). For systems with tiny FP stacks, I added the
option to save intermediate results on a software stack in the
recursive word. Concerning the straight-line code, it turned out that
the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
FP stack items); it's not clear to me why; on lxf I can use k=7 (and
it uses the 387 stack, too).
Well, that is strange ...
Results with the current iForth are quite different:
FORTH> bench ( see file quoted above + usual iForth timing words )
\ 7963 times
\ naive-sum : 0.999 seconds elapsed. ( 4968257259 )
\ unrolled-sum : 1.004 seconds elapsed. ( 4968257259 )
\ recursive-sum : 0.443 seconds elapsed. ( 4968257259 )
\ shift-reduce-sum : 2.324 seconds elapsed. ( 4968257259 ) ok
mhx@iae.nl (mhx) writes:[..]
Well, that is strange ...
The output should be the approximate number of seconds. Here's what I
get from the cycle:u numbers for iForth 5.1-mini given in the earlier postings:
\ ------------ input ---------- | output
6_482_017_334 scale 7 5 3 f.rdp 1.07534 ok
6_452_716_125 scale 7 5 3 f.rdp 1.07048 ok
2_949_273_264 scale 7 5 3 f.rdp 0.48927 ok
14_634_786_781 scale 7 5 3 f.rdp 2.42785 ok
The resulting numbers are not very different from those you show. My measurements include iForth's startup overhead, which may be one
explanation why they are a little higher.